As microservices architecture continues to evolve, interservice communication has become a significant challenge to manage. Service meshes are becoming the standard solution, but how do popular and up-and-coming service meshes compare?
Written by Guillaume Dury
This article was originally published in toptal.com Blog. Follow total.com for more articles.
When discussing microservices architecture and containerization, one set of production-proven tools has captured most of the attention in recent years: the service mesh.
Indeed, microservices architecture and Kubernetes (often stylized “K8s”) have quickly become the norm for scalable apps, making the problem of managing communications between services a hot topic—and service meshes an attractive solution. I myself have used service meshes in production, specifically Linkerd, Istio, and an earlier form of Ambassador. But what do service meshes do, exactly? Which one is the best to use? How do you know whether you should use one at all?
To answer these questions, it helps to understand why service meshes were developed. Historically, traditional IT infrastructure had applications running as monoliths. One service ran on one server; if it needed more capacity, the solution was to scale it vertically by provisioning a bigger machine.
Reaching the limits of this approach, big tech companies rapidly adopted a three-tier architecture, separating a load balancer from application servers and a database tier. While this architecture remained somewhat scalable, they decided to break down the application tier into microservices. The communication between these services became critical to monitor and secure in order for applications to scale.
To allow interservice communication, these companies developed in-house libraries: Finagle at Twitter, Hystrix at Netflix, and Stubby at Google (which became gRPC in 2016). These libraries aimed to fix problems raised by microservices architecture: security between services, latency, monitoring, and load balancing. But managing a big library as a dependency—in multiple languages—is complex and time-consuming.
In the end, that type of library was replaced with an easier-to-use, lightweight proxy. Such proxies were externally independent of the application layer—potentially transparent for the application—and easier to update, maintain, and deploy. Thus, the service mesh was born.
A service mesh is a software infrastructure layer for controlling communication between services; it’s generally made of two components:
Service meshes have three main goals around interservice communication:
This aspect of service mesh architecture allows for service discovery and dynamic routing. It also covers communication resiliency, such as retries, timeouts, circuit breaking, and rate limiting.
A staple feature of service meshes is load balancing. All services being meshed by proxies allows for the implementation of load-balancing policies between services, such as round robin, random, and least requests. Those policies are the strategy used by the service mesh to decide which replica will receive the original request, just like if you had tiny load balancers in front of each service.
Finally, service meshes offer routing control in the form of traffic shifting and mirroring.
In traditional microservices architecture, services communicate between themselves with unencrypted traffic. Unencrypted internal traffic is nowadays considered a bad practice in terms of security, particularly for public cloud infrastructure and zero-trust networks.
On top of protecting the privacy of client data where there’s no direct control over hardware, encrypting internal traffic adds a welcome layer of extra complexity in case of a system breach. Hence, all service meshes use mutual TLS (mTLS) encryption for interservice communication, i.e., all interproxy communication.
Service meshes can even enforce complex matrices of authorization policy, allowing or rejecting traffic based on policies targeting particular environments and services.
The goal of service meshes is to bring visibility to interservice communications. By controlling the network, a service mesh enforces observability, providing layer-seven metrics, which in turn allow for automatic alerts when traffic reaches some customizable threshold.
Usually supported by third-party tools or plug-ins like Jaeger or Zipkin, such control also allows for tracing by injecting HTTP tracing headers.
The service mesh was created to offset some of the operational burden created by a microservices architecture. Those with experience in microservices architectures know that they take a significant amount of work to operate on a daily basis. Taking full advantage of the potential of microservices normally requires external tooling to handle centralized logging, configuration management, and scalability mechanisms, among others. Using a service mesh standardizes these capabilities, and their setup and integration.
Service mesh observability, in particular, provides extremely versatile debugging and optimization methods. Thanks to granular and complete visibility on exchanges between services, engineers—particularly SREs—can more quickly troubleshoot possible bugs and system misconfigurations. With service mesh tracing, they can follow a request from its entry into the system (at a load balancer or external proxy) all the way to private services inside the stack. They can use logging to map a request and record the latency it encounters in each service. The end result: detailed insights into system performance.
Traffic management provides powerful possibilities before ramping up to a full release of a new versions of a service:
Service meshes streamline all of the above scenarios, making it easier to avoid and/or mitigate any surprises in production.
In many ways, service meshes are the ultimate set of tools for microservices architecture; many of them run on one of the top container orchestration tools, Kubernetes. We selected three of the main service meshes running on Kubernetes today: Linkerd (v2), Istio, and Consul Connect. We’ll also discuss some other service meshes: Kuma, Traefik Mesh, and AWS App Mesh. While currently less prominent in terms of usage and community, they’re promising enough to review here and to keep tabs on generally.
Not all Kubernetes service meshes take the same architectural approach, but a common one is to leverage the sidecar pattern. This involves attaching a proxy (sidecar) to the main application to intercept and regulate the inbound and outbound network traffic of the application. In practice, this is done in Kubernetes via a secondary container in each application pod that will follow the life cycle of the application container.
There are two main advantages to the sidecar approach to service meshes:
But sidecars are a mixed blessing because of how they directly impact an application:
Given those advantages and drawbacks, the sidecar approach is frequently a subject of debate in the service mesh community. That said, four of the six service meshes compared here use the Envoy sidecar proxy, and Linkerd uses its own sidecar implementation; Traefik Mesh does not use sidecars in its design.
Linkerd, which debuted in 2017, is the oldest service mesh on the market. Created by Buoyant (a company started by two ex-Twitter engineers), Linkerd v1 was based on Finagle and Netty.
Linkerd v1 was described as being ahead of its time since it was a complete, production-ready service mesh. At the same time, it was a bit heavy in terms of resource use. Also, significant gaps in documentation made it difficult to set up and run in production.
With that, Buoyant had a chance to work with a complete production model, gain experience from it, and apply that wisdom. The result was Conduit, the complete Linkerd rewrite the company released in 2018 and renamed Linkerd v2 later that year. Linkerd v2 brought with it several compelling improvements; since Buoyant’s active development of Linkerd v1 ceased long ago, our mentions of “Linkerd” throughout the rest of this article refer to v2.
Fully open source, Linkerd relies on a homemade proxy written in Rust for the data plane and on source code written in Go for the control plane.
Linkerd proxies have retry and timeout features but have no circuit breaking or delay injection as of this writing. Ingress support is extensive; Linkerd boasts integration with the following ingress controllers:
Service profiles in Linkerd offer enhanced routing capabilities, giving the user metrics, retry tuning, and timeout settings, all on a per-route basis. As for load balancing, Linkerd offers automatic proxy injection, its own dashboard, and native support for Grafana.
The mTLS support in Linkerd is convenient in that its initial setup is automatic, as is its automatic daily key rotation.
Out of the box, Linkerd stats and routes are observable via a CLI. On the GUI side, options include premade Grafana dashboards and a native Linkerd dashboard.
Linkerd can plug into an instance of Prometheus.
Tracing can be enabled via an add-on with OpenTelemetry (formerly OpenCensus) as the collector, and Jaeger doing the tracing itself.
Linkerd installation is done by injecting a sidecar proxy, which is done by adding an annotation to your resources in Kubernetes. There are two ways to go about this:
The second method starts with downloading and running an installation script:
curl -sL https://run.linkerd.io/install | sh
From there, the Linkerd CLI tool linkerd
provides a useful toolkit that helps install the rest of Linkerd and interact with the app cluster and the control plane.
linkerd check --pre
will run all the preliminary checks necessary for your Linkerd install, providing clear and precise logs on why a potential Linkerd install might not work just yet. Without --pre
, this command can be used for post-installation debugging.
The next step is to install Linkerd in the cluster by running:
linkerd install | kubectl apply -f -
Linkerd will then install many different components with a very tiny resource footprint; hence, they themselves have a microservices approach:
Linkerd bases most of its configuration on CustomResourceDefinitions
(CRDs). This is considered the best practice when developing Kubernetes add-on software—it’s like durably installing a plug-in in a Kubernetes cluster.
Adding distributed tracing—which may or may not be what Linkerd users are actually after, due to some common myths—requires another step with linkerd-collector and linkerd-jaeger. For that, we would first create a config file:
cat >> config.yaml << EOF
tracing:
enabled: true
EOF
To enable tracing, we would run:
linkerd upgrade --config config.yaml | kubectl apply -f -
As with any service mesh based on sidecar proxies, you will need to modify your application code to enable tracing. The bulk of this is simply adding a client library to propagate tracing headers; it then needs to be included in each service.
The traffic split feature of Linkerd, exposed through its Service Mesh Interface (SMI)-compliant API, already allows for canary releases. But to automate them and traffic management, you will also need external tooling like Flagger.
Flagger is a progressive delivery tool that will assess what Linkerd calls the “golden” metrics: “request volume, success rate, and latency distributions.” (Originally, Google SREs used the term golden signals and included a fourth one—saturation—but Linkerd doesn’t cover it because that would require metrics that are not directly accessible, such as CPU and memory usage.) Flagger tracks these while splitting traffic using a feedback loop; hence, you can implement automated and metrics-aware canary releases.
After delving into the installation process, it becomes clear that to have a Linkerd service mesh operational and exploiting commonly desired capabilities, it’s easy to end up with at least a dozen services running. That said, more of them are supplied by Linkerd upon installation than with other service meshes.
Advantages:
Linkerd benefits from the experience of its creators, two ex-Twitter engineers who had worked on the internal tool, Finagle, and later learned from Linkerd v1. As one of the first service meshes, Linkerd has a thriving community (its Slack has more than 5,000 users, plus it has an active mailing list and Discord server) and an extensive set of documentation and tutorials. Linkerd is mature with its release of version 2.9, as evidenced by its adoption by big corporations like Nordstrom, eBay, Strava, Expedia, and Subspace. Paid, enterprise-grade support from Buoyant is available for Linkerd.
Drawbacks:
There’s a pretty strong learning curve to use Linkerd service meshes to their full potential. Linkerd is only supported within Kubernetes containers (i.e., there’s no VM-based, “universal” mode). The Linkerd sidecar proxy is not Envoy. While this does allow Buoyant to tune it as they see fit, it removes the inherent extensibility that Envoy offers. It also means Linkerd is missing support for circuit breaking, delay injection, and rate limiting. No particular API is exposed to control the Linkerd control plane easily. (You can find the gRPC API binding, though.)
Linkerd has made great progress since v1 in its usability and ease of installation. The lack of an officially exposed API is a notable omission. But thanks to otherwise well-thought-out documentation, out-of-the-box functionality in Linkerd is easy to test.
Our next service mesh contender, Consul Connect, is a unique hybrid. Consul from HashiCorp is better known for its key-value storage for distributed architectures, which has been around for many years. After the evolution of Consul into a complete suite of service tools, HashiCorp decided to build a service mesh on top of it: Consul Connect.
To be precise about its hybrid nature:
The Consul Connect data plane is based on Envoy, which is written in C++. The control plane of Consul Connect is written in Go. This is the part that is backed by Consul KV, a key-value store.
Like most of the other service meshes, Consul Connect works by injecting a sidecar proxy inside your application pod. In terms of architecture, Consul Connect is based around agents and servers. Out of the box, Consul Connect is meant to have high availability (HA) using three or five servers as a StatefulSet
specifying pod anti-affinity. Pod anti-affinity is the practice of making sure pods of a distributed software system won’t end up on the same node (server), thereby guaranteeing availability in case any single node fails.
There’s not much that makes Consul Connect stand out in this area; it provides what any service mesh does (which is quite a bit), plus layer-seven features for path-based routing, traffic shifting, and load balancing.
As with the other service meshes, Consul Connect provides basic mTLS capabilities. It also features native integration between Consul and Vault (also by HashiCorp), which can be used as a CA provider to manage and sign certificates in a cluster.
Consul Connect takes the most common observability approach by incorporating Envoy as a sidecar proxy to provide layer-seven capabilities. Configuring the UI of Consul Connect to fetch metrics involves changing a built-in configuration file and also enabling a metric provider like Prometheus. It’s also possible to configure some Grafana dashboards to show relevant service-specific metrics.
Consul Connect is installed into a Kubernetes cluster using a Helm chart:
helm repo add hashicorp https://helm.releases.hashicorp.com
You will need to modify the default values.yaml
if you want to enable the UI or make the Consul Connect module inject its sidecar proxy:
helm install -f consul-values.yml hashicorp hashicorp/consul
To consult members and check the various nodes, Consul recommends exec
ing into one of the containers then using the CLI tool consul
.
To serve up the out-of-the-box web UI that Consul provides, run kubectl port-forward service/hashicorp-consul-ui 18500:80
.
Advantages:
Drawbacks:
Consul Connect can be a very good choice for those who seek an enterprise-grade product. HashiCorp is known for the quality of its products, and Consul Connect is no exception. I can see two strong advantages here compared to other service meshes: strong support from the company with the enterprise version and a tool ready for all kinds of architectures (not just Kubernetes).
In May 2017, Google, IBM, and Lyft announced Istio. When Istio entered the race of service mesh tools, it gained very good exposure in the tech space. Its authors have added features based on user feedback all the way through version 1.9.
Istio promised important new features over its competitors at the time: automatic load balancing, fault injection, and many more. This earned it copious attention from the community, but as we’ll detail below, using Istio is far from trivial: Istio has been recognized as particularly complex to put into production.
Historically, the Istio project has bounced around frequently in terms of source code changes. Once adopting a microservice architecture for the control plane, Istio has now (since version 1.5) moved back to a monolithic architecture. Istio’s rationale for returning to centralization was that microservices were hard for operators to monitor, the codebase was too redundant, and the project had reached organizational maturity—it no longer needed to have many small teams working in silos.
However, along the way Istio gained notoriety for having one of the highest volumes of open GitHub issues. As of this writing, the count stands at about 800 issues open and about 12,000 closed. While issue counts can be deceiving, in this case, they do represent a meaningful improvement in terms of previously broken features and out-of-control resource usage.
Istio is pretty strong at traffic management compared to Consul Connect and Linkerd. This is thanks to an extensive offering of sub-features: request routing, fault injection, traffic shifting, request timeouts, circuit breaking, and controlling ingress and egress traffic to the service mesh. The notion of virtual services and destination rules makes it very complete in terms of traffic management.
However, all those possibilities come with a learning curve, plus the management of those new resources on your Kubernetes cluster.
Istio boasts a comprehensive set of security-related tools, with two main axes: authentication and authorization. Istio can enforce different levels of policies on different scopes: workload-specific, namespacewide, or meshwide. All such security resources are managed through Istio CRDs such as AuthorizationPolicy
or PeerAuthentication
.
Beyond the standard mTLS support, Istio can also be configured to accept or reject unencrypted traffic.
Here, Istio is pretty advanced out of the box, with several types of telemetry offering solid insights into the service mesh. Metrics are based on the four golden signals (latency, traffic, errors, and, to some extent, saturation).
Notably, Istio provides metrics for the control plane itself. It also serves distributed traces and access logs, boasting explicit compatibility with Jaeger, Lightstep, and Zipkin to enable tracing.
There is no native dashboard, but there is official support for the Kiali management console.
Installation is as straightforward as following the official steps. Istio is also natively integrated into GKE as a beta feature, but there GKE uses Istio 1.4.X, the older microservices version as opposed to the latest monolith version.
A native install starts with downloading the latest release:
curl -L https://istio.io/downloadIstio | sh -
After cd
ing into the newly created istio-*
folder, you can add it to your path so you can use the istioctl
utility tool:
export PATH=$PWD/bin:$PATH
From there, you can install Istio to your Kubernetes cluster via istioctl
:
istioctl install
This will install Istio with a default
profile. istioctl
profiles allow you to create different installation configurations and switch between them if necessary. But even in a single-profile scenario, you can deeply customize the install of Istio by modifying some CRDs.
Istio resources will be harder to manage as you will need to manage several kinds of CRDs—VirtualService
, DestinationRule
, and Gateway
at a minimum—to make sure traffic management is in place.
VirtualService
resource is a configuration declaring a service and the different routing rules that are applied to requests.DestinationRule
resource is used to group and enforce target-specific traffic policies.Gateway
resource is created to manage inbound and outbound service mesh traffic (i.e., additional Envoy proxies, but running at the edge rather than as sidecars.)The semantic details are beyond the scope of this review, but let’s look at a quick example showing each of these working together. Suppose we have an e-commerce website with a service called products
. Our VirtualService
might look like this:
apiVersion: networking.istio.io/v1alpha3
kind: VirtualService
metadata:
name: products-route
namespace: ecommerce
spec:
hosts:
- products # interpreted as products.ecommerce.svc.cluster.local
http:
- match:
- uri:
prefix: "/listv1"
- uri:
prefix: "/catalog"
rewrite:
uri: "/listproducts"
route:
- destination:
host: products # interpreted as products.ecommerce.svc.cluster.local
subset: v2
- route:
- destination:
host: products # interpreted asproducts.ecommerce.svc.cluster.local
subset: v1
The corresponding DestinationRule
could then be:
apiVersion: networking.istio.io/v1alpha3
kind: DestinationRule
metadata:
name: products
spec:
host: products
trafficPolicy:
loadBalancer:
simple: RANDOM # or LEAST_CONN or ROUND_ROBIN
subsets:
- name: v1
labels:
version: v1
- name: v2
labels:
version: v2
- name: v3
labels:
version: v3
Lastly, our Gateway
:
apiVersion: networking.istio.io/v1alpha3
kind: Gateway
metadata:
name: cert-manager-gateway
namespace: istio-system
spec:
selector:
istio: ingressgateway
servers:
- port:
number: 80
name: http
protocol: HTTP
hosts:
- "*"
With these three files in place, an Istio installation would be ready to handle basic traffic.
Advantages:
Drawbacks:
istiod
has stabilized around using 1 vCPU and 1.5 GB of memory.)Istio is a great example of tech giants coming together to create an open-source project to address a challenge they’re all facing. It took some time for the Istio project as a whole to structure itself (recalling the microservices-to-monolith architectural shift) and resolve its many initial issues. Today, Istio is doing absolutely everything one would expect from a service mesh and can be greatly extended. But all these possibilities come with steep requirements in terms of knowledge, work hours, and computing resources to support its usage in a production environment.
Created by Kong and then open-sourced, Kuma reached 1.0 in late 2020. To some extent, it was created in response to the first service meshes being rather heavy and difficult to operate.
Its feature list suggests it to be very modular; the idea behind Kuma is to orient it toward integration with applications already running on Kubernetes or other infrastructure.
The universal (non-Kubernetes) version of Kuma requires PostgreSQL as a dependency, and Kong’s CTO has been notably against supporting SMI. But Kuma was developed with the idea of multicloud and multicluster deployments from the start, and its dashboard reflects this.
While Kuma is still young in the service mesh space, with few cases of production use so far, it’s an interesting and promising contender.
Originally named Maesh, Traefik Mesh (by Traefik Labs) is another newcomer in the service mesh tooling race. The project mission is to democratize the service mesh by making it easy to use and configure; the developers’ experience with the well-thought-out Traefik Proxy put them in a prime position to accomplish that.
Traefik Mesh needs CoreDNS to be installed on the cluster. (While Azure has used CoreDNS by default since 1.12, GKE defaults to kube-dns as of this writing, meaning there’s a significant extra step involved in that case.) It also lacks multicluster capabilities.
That said, Traefik Mesh is unique within our service mesh comparison in that it doesn’t use sidecar injection. Instead, it’s deployed as a DaemonSet on all nodes to act as a proxy between services, making it non-invasive. Overall, Traefik Mesh is simple to install and use.
In the world of cloud providers, AWS is the first one to have implemented a native service mesh pluggable with Kubernetes (or EKS in particular) but also its other services. AWS App Mesh was released in November 2018 and AWS has been iterating on it since then. The main advantage of AWS App Mesh is the preexisting AWS ecosystem and market position; the big community behind AWS overall will continue to drive its usage and usability.
The project is not open source, doesn’t support SMI, and there’s not much info online about HA standards for the control plane. AWS App Mesh is more complex to set up than other Kubernetes-native service meshes and has very little community online (24 answers on Stack Overflow, 400 stars on GitHub) but that’s because users are meant to benefit from AWS support.
AWS App Mesh has native integration with AWS, starting with EKS and extending to other AWS services like ECS (Fargate) and EC2. Unlike Traefik Mesh, it has multicluster support. Plus, like most service meshes, it’s based on injecting Envoy, the battle-tested sidecar proxy.
The six Kubernetes service mesh options presented here have a few things in common:
But there are certainly differences worth highlighting when it comes to service mesh infrastructure, traffic management, observability, deployment, and other aspects.
Linkerd | Consul | Istio | Kuma | Traefik Mesh | AWS App Mesh | |
---|---|---|---|---|---|---|
Platforms | Kubernetes | Kubernetes, VM (Universal) | Kubernetes; VM (Universal) is in beta as of 1.9 | Kubernetes, VM (Universal) | Kubernetes | AWS EKS, ECS, FARGATE, EC2 |
High Availability for Control Plane | Yes (creates exactly three control planes) | Yes (with extra servers and agents) | Yes (through Horizontal Pod Autoscaler [HPA] on Kubernetes) | Yes (horizontal scaling) | Yes (horizontal scaling) | Yes (by virtue of supporting AWS tech being HA) |
Sidecar Proxy | Yes, linkerd-proxy | Yes, Envoy (can use others) | Yes, Envoy | Yes, Envoy | No | Yes, Envoy |
Per-node Agent | No | Yes | No | No | Yes | No |
Ingress Controller | Any | Envoy and Ambassador | Istio Ingress or Istio Gateway | Any | Any | AWS Ingress Gateway |
Linkerd | Consul | Istio | Kuma | Traefik Mesh | AWS App Mesh | |
---|---|---|---|---|---|---|
Blue-green Deployment | Yes | Yes (using traffic splitting) | Yes | Yes | Yes (using traffic splitting) | Yes |
Circuit Breaking | No | Yes (through Envoy) | Yes | Yes | Yes | Yes |
Fault Injection | Yes | No | Yes | Yes | No | No |
Rate Limiting | No | Yes (through Envoy, with the help of official Consul docs) | Yes | Not yet, except by configuring Envoy directly | Yes | No |
Linkerd | Consul | Istio | Kuma | Traefik Mesh | AWS App Mesh | |
---|---|---|---|---|---|---|
Monitoring with Prometheus | Yes | No | Yes | Yes | Yes | No |
Integrated Grafana | Yes | No | Yes | Yes | Yes | No |
Distributed Tracing | Yes (OpenTelemetry*) | Yes | Yes (OpenTelemetry*) | Yes | Yes (OpenTelemetry*) | Yes (AWS X-Ray or any open-source alternative) |
* OpenCensus and OpenTracing merged into OpenTelemetry in 2019, but you might find Linkerd articles referring to OpenCensus, as well as Istio and Traefik Mesh articles referring to OpenTracing.
Linkerd | Consul | Istio | Kuma | Traefik Mesh | AWS App Mesh | |
---|---|---|---|---|---|---|
Multicluster | Yes (recently) | Yes (federated) | Yes | Yes (multizone) | No | Yes |
Mesh expansion | No | Yes | Yes | Yes | No | Yes (for AWS services) |
GUI | Yes (out of the box) | Yes (Consul UI) | No native GUI but can use Kiali | Yes (native Kuma GUI) | No | Yes (through Amazon CloudWatch) |
Deployment | via CLI | via Helm chart | via CLI, via Helm chart, or via operator container | via CLI, via Helm chart | via Helm chart | via CLI |
Management Complexity | Low | Medium | High | Medium | Low | Medium |
Linkerd | Consul | Istio | Kuma | Traefik Mesh | AWS App Mesh | |
---|---|---|---|---|---|---|
Open Source | Yes | Yes | Yes | Yes | Yes | No |
Exposed API | Yes, but not documented | Yes, and fully documented | Yes, entirely through CRDs | Yes, and fully documented | Yes, but intended for debugging (GET-only); also, SMI via CRDs | Yes, and fully documented |
SMI Specification Support | Yes | Yes (partial) | Yes | No | Yes | No |
Special Notes | Needs PostgreSQL to run outside of Kubernetes | Needs CoreDNS installed on its cluster | Fully managed by AWS |
Now that we have seen what service meshes are, how they work, and the multitude of differences between them, we can ask: Do we need a service mesh?
That’s the big question for SREs and cloud engineers these past few years. Indeed, microservices bring operational challenges in network communication that a service mesh can solve. But service meshes, for the most part, bring their own challenges when it comes to installation and operation.
One problem we can see emerging in many projects is that with service meshes, there is a gap between the proof-of-concept stage and the production stage. That is, it’s unfortunately rare for companies to achieve staging environments that are identical to production in every aspect; with service meshes involving crucial infrastructure, scale- and edge-related effects can bring deployment surprises.
Service meshes are still under heavy development and improvement. This could actually be attractive for teams with high deployment velocities—those who have mastered “the art of staying state-of-the-art” and can closely follow the development cycle of cloud-native tools.
For others, though, the pace of service mesh evolution could be more of a pitfall. It would be easy enough to set up a service mesh but then forget about the need to maintain it. Security patches may go unapplied or, even if remembered, may carry with them unplanned issues in the form of deprecated features or a modified set of dependencies.
There’s also a notable cost in terms of manpower to set up a service mesh in production. It would be a sensible goal for any team to evaluate this and understand if the benefits from a service mesh would outweigh the initial setup time. Service meshes are hard, no matter what the “easy” demo installations show.
In short, service meshes can solve some of the problems typical to projects deployed at scale but may introduce others, so be prepared to invest time and energy. In a hypothetical infrastructure involving 25 microservices and load of five queries per second, I would recommend having at least one person (preferably two) dedicated for at least a month to preparing a proof of concept and validating key aspects before thinking about running it in production. Once it’s set up, anticipate the need for frequent upgrades—they will impact a core component of your infrastructure, namely network communications.
We’ve seen what service meshes are: a suite of tooling to answer new challenges introduced by microservices architecture. Through traffic management, observability, service discovery, and enhanced security, service meshes can reveal deep insights into app infrastructure.
There are multiple actors on the market, sometimes promoted by GAFAM et al., that have in some cases open-sourced or promoted their own internal tooling. Despite two different implementation types, service meshes will always have a control plane—the brain of the system—and a data plane, made of proxies that will intercept the requests of your application.
Reviewing the three biggest service meshes (Linkerd, Consul Connect, and Istio) we have seen the different strategies they’ve chosen to implement and the advantages they bring. Linkerd, being the oldest service mesh on the market, benefits from its creators’ experience at Twitter. HashiCorp, for its part, offers the enterprise-ready Consul Connect, backed by a high level of expertise and support. Istio, which initially garnered ample attention online, has evolved into a mature project, delivering a feature-complete platform in the end.
But these actors are far from being the only ones, and some less-talked-about service meshes have emerged: Kuma, Traefik Mesh, and AWS App Mesh, among others. Kuma, from Kong, was created with the idea of making the service mesh idea “simple” and pluggable into any system, not just Kubernetes. Traefik Mesh benefited from experience with the preexisting Traefik Proxy and made the rare decision to eschew sidecar proxies. Last but not least, AWS decided to develop its own version of the service mesh, but one that relies on the dependable Envoy sidecar proxy.
In practice, service meshes are still hard to implement. Although service mesh benefits are compelling, their impact, critically, cuts both ways: Failure of a service mesh could render your microservices unable to communicate with each other, possibly bringing down your entire stack. A notorious example of this: Incompatibility between a Linkerd version and a Kubernetes version created a complete production outage at Monzo, an online bank.
Nonetheless, the whole industry is structuring itself around Kubernetes and initiatives like the Microsoft-spearheaded SMI, a standard interface for service meshes on Kubernetes. Among numerous other projects, the Cloud Native Computing Foundation (CNCF) has the Envoy-based Open Service Mesh (OSM) initiative, which was also originally introduced by Microsoft. The service mesh ecosystem remains abuzz, and I predict a champion emerging in the coming years, the same way Kubernetes became the de facto container orchestration tool.
This article was originally published in toptal.com Blog. Follow total.com for more articles.
About the Author
Guillaume is a seasoned DevOps engineer who is constantly searching for ways to improve automation to help developers deliver faster. With an entrepreneurial mindset, he is always ready to take on new challenges. His taste for fast-paced environments led him to Hong Kong where he set up the infrastructure of two startups and one hedge fund’s web application.
Disclaimer:
The views expressed and the content shared in all published articles on this website are solely those of the respective authors, and they do not necessarily reflect the views of the author’s employer or the techbeatly platform. We strive to ensure the accuracy and validity of the content published on our website. However, we cannot guarantee the absolute correctness or completeness of the information provided. It is the responsibility of the readers and users of this website to verify the accuracy and appropriateness of any information or opinions expressed within the articles. If you come across any content that you believe to be incorrect or invalid, please contact us immediately so that we can address the issue promptly.
Tags: kubernetes · OpenShift · Service Mesh · service mesh comparison
Editorial
techbeatly is a platform to share knowledge on Cloud, DevOps, Infrastructure and Opensource. We are focusing on technology tracks like Automation, Containerisation, CI/CD, Infrastructure as Code etc.
Subscribe to YouTube
This site uses Akismet to reduce spam. Learn how your comment data is processed.
Leave a Reply