An Exploratory Guide to the Service Mesh Platforms
The shift to microservices comes with its own set of challenges. If architecting, designing, and developing microservices is considered to be complex, deploying and managing them is no less complex.
Developers need to ensure that communication across the services is secure. They also need to implement distributed tracing that tells how long each invocation takes. Some of the best practices of distributed services such as retries, circuit breakers bring resiliency to services. Microservices are typically polyglot and use disparate libraries and SDKs. Writing a generic, reusable software to manage intra-service communication across different protocols such as HTTP, gRPC, and GraphQL is complex, expensive and time-consuming.
After a microservices-based application is deployed, day two operations are performed by the DevOps teams. They need to monitor the service health, latency, logs, events and tracing. DevOps teams are also expected to implement policy-based routing to configure blue/green deployments, canary releases, and rolling upgrades. Finally, the metrics, events, logs, and alerts originating from multiple microservices needs to be aggregated and integrated with existing observability and monitoring stacks.
Service mesh, a recent phenomenon in the cloud native and microservices world, attempts to solve these problems for developers and operators. After container orchestration, if there is one technology that has gained the attention of developers and operators, it is definitely the service mesh. Cloud native advocates recommend using service mesh when running microservices in production environments.
Service mesh frees developers from building language-specific SDKs and tools to manage the intra-service communication. For operators, service mesh delivers out-of-the-box traffic policies, observability, and insights from the stack.
The best thing about a service mesh is that it is a “zero-touch” software that doesn’t force change in the code or configuration. By leveraging the patterns of the sidecar, a service mesh injects a proxy into every service which acts as an agent for the host service. Since the agent or the proxy intercepts every inbound and outbound call, it gains unmatched visibility into the call stack. Each proxy associated with a service sends the telemetry collected from the call stack to a centralized component which also acts as a control plane. When operators configure a traffic policy, they submit it to the control plane which pushes that into the proxy to influence the traffic. Software Reliability Engineers (SREs) leverage the observability of the service mesh to gain insights into the application.
Service mesh integrates with a Kubernetes ingress controller or an existing API gateway. While the API gateway and ingress tackle the north-south traffic, a service mesh is responsible for the east-west traffic.
To summarize, a service mesh is an infrastructure layer that enables secure service-to-service communication. It relies on lightweight network proxies deployed alongside each microservice. A centralized control plane orchestrates the proxies to manage traffic policies, security, and observability.
Even though service mesh is predominantly used with microservices packaged as containers, it can also be integrated with VMs and even physical servers. By leveraging the traffic policies of service mesh efficiently, applications running across multiple environments can be seamlessly integrated. This factor makes service mesh as one of the key enablers of the hybrid cloud and multi-cloud.
There are multiple service mesh choices available to businesses. This article attempts to help compare and contrast some of the mainstream service mesh platforms available in the cloud native ecosystem.
AWS App Mesh
Launched at AWS re:Invent 2018, AWS App Mesh is designed to bring the benefits of a service mesh to Amazon Web Services’ compute and container services. It can be easily configured with Amazon EC2, Amazon ECS, AWS Fargate, Amazon EKS, and even AWS Outposts.
Since App Mesh can act as a service mesh for both VMs and containers, Amazon created an abstraction layer based on virtual services, virtual nodes, virtual routers, and virtual routes.
A virtual service represents an actual service deployed in a VM or a container. Each version of a virtual service is mapped to a virtual node. There is a one to many relationships between a virtual service and virtual node. When a new version of a microservice is deployed, it is simply configured as a virtual node. Similar to a network router, a virtual router acts as an endpoint for the virtual node. The virtual router has one or more virtual routes that adhere to the traffic policies and retry policies. A mesh object acts as a logical boundary for all the related entities and services.
A proxy is associated with each service participating in the mesh which handles all the traffic flowing within the mesh.
Let’s assume that we are running two services in AWS – servicea.apps.local and serviceb.apps.local.
We can easily mesh-enable these services without modifying the code.
We notice that serviceb.apps.local has virtual service, a virtual node, a virtual router with two virtual routes that decide the percentage of traffic sent to v1 and v2 of the microservice.
Like most of the service mesh platforms, AWS App Mesh also relies on the open source Envoy proxy data plane. The App Mesh control plane is built with AWS compute services in mind. Amazon has also customized the Envoy proxy to support this control plane.
When using AWS App Mesh with Amazon EKS, you get the benefits of automated sidecar injection along with the ability to define the App Mesh entities in YAML. Amazon has built CRDs for EKS to simplify the configuration of App Mesh with standard Kubernetes objects.
The telemetry generated by AWS App Mesh can be integrated with Amazon CloudWatch. The metrics may be exported to third-party services such as Splunk, Prometheus, and Grafana, as well as open-tracing solutions like Zipkin and LightStep.
For customers using AWS compute services, AWS App Mesh is free. There is no additional charge for AWS App Mesh.
Consul from HashiCorp was launched as a service discovery platform with an in-built key/value store. It acts as an efficient, lightweight load balancer for services running within the same host or in a distributed environment. Consul exposes a DNS query interface for discovering the registered services. It also performs health checks for all the registered services.
Consul was created much before containers and Kubernetes became mainstream. But the rise of microservices and service mesh prompted HashiCorp to augment Consul to a full-blown service mesh platform. Consul leverages its service mesh feature called Connect to provide service-to-service connection authorization and encryption using mutual Transport Layer Security (TLS).
Since the sidecar pattern is the most preferred approach to service mesh, Consul Connect comes with its own proxy to handle inbound and outbound service connections. Based on a plugin architecture, Envoy can be used as an alternative proxy for Consul.
Consul adds two essential capabilities to Consul — security, and observability.
By default, Consul adds a TLS certificate to the service endpoints to implement mutual TLS (mTLS). This ensures that the service-to-service communication is always encrypted. Security policies are implemented through intentions that define access control for services and are used to control which services may establish connections. Intentions can either deny or allow traffic originating from a specific service. For example, a database service can deny the inbound traffic coming directly from the web service but allow the request made via the business logic service.
When Envoy is used as a proxy with Consul Connect, it takes advantage of the L7 observability features. Envoy integrated with Consul Connect can be configured to send the telemetry to a variety of sources including statsd, dogstatsd, and Prometheus.
Depending on the context, Consul can act as a client (agent) or server, it supports sidecar injection when integrated with orchestrators such as Nomad and Kubernetes. There is a Helm chart to deploy Consul Connect in Kubernetes. The Consul Connect configuration and metadata are added as annotations to the pod spec submitted to Kubernetes. It can integrate with Ambassador, an ingress controller from Datawire that handles the north-south traffic.
Consul lacks advanced traffic routing and splitting capabilities for implementing blue/green deployments or canary releases. Compared to other service mesh choices, it’s security traffic policies are not very flexible. With the integration of Envoy, some of the advanced routing policies may be configured. But, Consul Connect doesn’t offer an interface for that.
Overall, Consul and Consul Connect are robust service discovery and mesh platforms that are simple to manage.
Istio is one of the most popular open source service mesh platforms backed by Google, IBM, and Red Hat.
Istio is also one of the first service mesh technologies to use Envoy as the proxy. It follows the standard approach of a centralized control plane and distributed data plane associated with microservices.
Though Istio can be used with virtual machines, it’s predominantly integrated with Kubernetes. Pods deployed in a specific namespace can be configured to have an automatic sidecar injection where Istio will attach the data plane component to the pod.
Istio delivers three chief capabilities to microservices developers and operators:
- Traffic management: Istio simplifies the configuration of service-level attributes such as circuit breakers, timeouts, and retries, and makes it easy to implement configurations like A/B testing, canary rollouts, and staged rollouts with percentage-based traffic splits. It also provides out-of-box failure recovery features that help make your application more robust against failures of dependent services or the network. Istio comes with its own Ingress that handles the north-south traffic. For an end-to-end guide on implementing blue/green deployments with Istio, refer to my past tutorial.
- Security: Istio provides out-of-the-box security capabilities for intra-service communication. It provides the underlying secure communication channel and manages authentication, authorization, and encryption of service communication at scale. With Istio, service communications are secured by default, letting developers and operators enforce policies consistently across diverse protocols and runtimes with no code or configuration changes.
- Observability: Since Istio’s data plane intercepts the inbound and outbound traffic, it has visibility into the current state of deployment. Istio delivers robust tracing, monitoring, and logging features that provide deep insights into the service mesh deployment. Istio comes with an integrated and pre-configured Prometheus and Grafana dashboards for observability. Refer to my tutorial on configuring and accessing Istio’s observability dashboards.
Google and IBM offer managed Istio as a part of their hosted Kubernetes platforms. Google built Knative as a serverless compute environment based on Istio. For Google services such as Anthos and Cloud Run, Istio has become the core foundation. When compared to other offerings, Istio is considered to be a complex and heavy service mesh platform. But the extensibility and rich capabilities make it the preferred platform for enterprises.
Launched in September 2019, Kuma is one of the recent entrants into the service mesh ecosystem. It is developed and maintained by Kong, Inc, an API gateway company that built the open source and commercial product by the same name, Kong.
Kuma is a logical extension to Kong’s API gateway. The former handles the north-south traffic while the latter manages the east-west traffic.
Like most of the service mesh platforms, Kuma comes with separate data plane and control plane components. The control plane is the core enabler for the service mesh that holds the master truth for all the service configurations and infinitely scales to manage tens of thousands of services across an organization. Kuma couples a fast data plane with an advanced control plane that allows users to easily set permissions, expose metrics and set routing policies through the Custom Resource Definitions (CRD) in Kubernetes or REST API.
Kuma’s data plane is tightly integrated with Envoy proxy which lets the data plane run in virtual machines or containers deployed in Kubernetes.
Kuma has two modes of deployment: 1) Universal and 2) Kubernetes. When running in Kubernetes, Kuma leverages the API server and etcd database to store the configuration. In universal mode, it needs an external PostgreSQL as the datastore.
Kuma-cp, the control plane component manages one or more data plane components, kuma-dp. Each microservice registered with the mesh runs an exclusive copy of kuma-dp. In Kubernetes, kuma-cp runs as a CRD within the kuma-system namespace. A namespace that’s annotated for kuma can inject the data plane into each pod.
Kuma comes with a GUI that provides an overview of the deployment including the state of each data plane registered with the control plane. The same interface can be used to view the health checks, traffic policies, routes, and traces from the proxies attached to the microservices.
Kuma service mesh has a builtin CA that’s used to encrypt the traffic based on mTLS. Traffic permissions can be configured based on labels associated with the microservices. Tracing can be integrated with Zipkin while metrics can be redirected to Prometheus.
Some of the advanced resilience features such as circuit breaking, retries, fault injection, and delay injection are missing in Kuma.
Kuma is a well-designed, clean implementation of a service mesh. Its integration with Kong Gateway may drive its adoption among existing users and customers.
Linkerd 2.x is an open source service mesh exclusively built for Kubernetes by Buoyant. It’s licensed under Apache V2 and is a Cloud Native Computing Foundation incubating project.
Linkerd is an ultra-lightweight, and easy to install service mesh platform. It has three components – 1) CLI & UI, 2) control plane and 3) data plane.
Once the CLI is installed on a machine that can talk to a Kubernetes cluster, the control plane can be installed with a single command. All the components of the control plane are installed as Kubernetes deployments within the linkerd namespace. The web and CLI tools use the API server of the controller. The destination component tells the proxies running the data plane about the routing information. The injector is a Kubernetes admission controller, which receives a webhook request every time a pod is created. This service is used for injecting the proxy as a sidecar every pod launched in a namespace. The identity component is responsible for managing the certificates that are essential to implementing the mTLS connection between proxies. The tap component receives requests from the CLI and the web UI to watch requests and responses in real-time.
Linkerd comes with pre-configured Prometheus and Grafana components providing out-of-the-box dashboards.
The data plane has a lightweight proxy that attaches itself to the service as a sidecar. There is a Kubernetes Init Container to configure the iptables to define the flow of traffic, and connecting the proxy to the control plane.
Linkerd complies with all the attributes of a service mesh — Traffic routing/splitting, security, and observability.
For a detailed overview of Linkerd, refer to my previous analysis.
It’s interesting to note that Linkerd doesn’t use Envoy as the proxy. Instead, it relies on a purpose-built, lightweight proxy written in Rust programming language. Linkerd doesn’t have an ingress built into the stack but it can work in conjunction with an ingress controller.
After Istio, Linkerd is one of the popular service mesh platforms. It has the attention and mindshare of developers and operators considering a lightweight and easy to use service mesh.
Maesh comes from Containous, the company that built the popular ingress, Traefik. Similar to Kong, Inc, Containous built Maesh to complement Traefik. While Maesh handles the east-west traffic flowing within the microservices, Traefik drives the north-south traffic. Like Kuma, Maesh can also work with other ingress controllers.
Maesh takes a different approach compared to other service mesh platforms. It doesn’t use a sidecar pattern to manipulate the traffic. Instead, it deploys a pod per Kubernetes node to provide a well-defined service endpoint. Microservices can continue work as is even when Maesh is deployed. But, when they use the alternative endpoint exposed by Maesh, they can take advantage of the service mesh capabilities.
The objective of Maesh is to provide a non-intrusive and non-invasive infrastructure that provides an opt-in capability to developers. But it also means that the platform lacks some of the key capabilities such as transparent TLS encryption.
Maesh supports the baseline features of service mesh including routing and observability except for security. It supports the latest specs defined by the Service Mesh Interface (SMI) project.
Out of all the service mesh technologies that I deployed in Kubernetes, I found Maesh to be the simplest and fastest platform.
Janakiram MSV’s Webinar series, “Machine Intelligence and Modern Infrastructure (MI2)” offers informative and insightful sessions covering cutting-edge technologies. Sign up for the upcoming MI2 webinar at http://mi2.live.