Service Mesh and the Promise of Istio
At this year’s DockerCon 2018, the concept of the service mesh was a hot topic. The service mesh, as embodied by software such as Istio, came up in sessions, hallway conversations, even made an appearance on the keynote stage.
The thesis behind service mesh is that there is a set of common network-related tasks such as routing, retries, load balancing, and even authentication that can be abstracted away from both the applications and the underlying networks. The mesh, therefore, is nothing but a network of software entities that perform such tasks for different services. Without such an abstraction, you either embed these tasks as part of the networking infrastructure (e.g., L3 or L4), or code that into the application layer (as in a L7 network overlay).
In a microservices environment, neither of these options is ideal. The application overlay approach is application aware and can perform sophisticated content-based routing, but it will lead to a large amount of redundant code in each service and potentially a lower performance. Conversely, relying on traditional L3 or L4 networking means that it has neither the concept nor the visibility of service requests, which are critical to making optimal routing decisions.
This is why service mesh is so appealing for the microservice environment — it operates at the L7 level, but is separate from the application code and can enforce L3/L4 policies with app-level insight. To understand this point, we must first dig into the architecture of service mesh.
The Service Mesh Architecture and the Power of Abstraction
Service mesh is an intentionally designed abstraction that consists of a control plane and a data plane.
- Data plane: The data plane deals with the actual traffic from one application to another. Any networking aspects regarding the actual service requests, such as routing, forwarding, load balancing, even authentication and authorization are part of the service mesh data plane. The data plane concept is not new. In a traditional network, routers and switches are the components that form the data plane. In a service mesh, however, the data plane is a collection of sidecar proxies — one proxy for each service.
From the standpoint of the service, it knows nothing about the network other than the way it interacts with the proxy. The proxy, in this case, acts as an abstraction for the network. This means that developers for the service can concentrate on what the service is about without worrying about the nuances of the network. An open source example of such a sidecar proxy is Envoy.
- Control plane: Putting it simply, the control plane is the entity that connects the various data planes into a distributed network. This is the policy and management layer of the service mesh, largely responsible for collecting telemetry data and making smart decisions about configurations, who can talk to whom, and the enforcement of such policies. An open source example of such a service mesh control plane is Istio.
This architecture achieves the goals that service mesh sets out to deliver, namely, high performance, intelligent, and secure traffic management. The data planes are a set of high-performance proxy that intercepts network traffic and interacts with the network layer to route traffic. The control plane has Layer 7 insight and can instruct the data plane to make complex routing decisions based on policies, security postures, and real-time telemetry information.
The abstractions provided by service mesh offer powerful separations that help developers, the ops team, and security. The data plane abstracts the underlying network from the application. The control plane further abstracts away the decision engine logic, which means the data plane can focus on being the high performing traffic interceptor and router. Together, a service mesh can make intelligent, dynamic routing decisions automatically without requiring any changes to the application code.
Another fundamental concept in service mesh is service identity. That is, each service is assigned a cryptographically strong identity. Managing services in the context of strong identities enables fine-grained, identity-based policies that previously were not possible to enforce.
Key Capabilities and Top Use Cases for Service Mesh
Today, the service mesh space is getting increased attention. Some of the key capabilities of service mesh include:
- Inventory and visibility: Providing insight and visibility to which services are running, who is talking to whom, and service dependencies.
- Performance management: Here performance means response time, resource utilization, and ultimately correlation of app performance with business metrics. Through service mesh, an organization can set certain performance metrics to ensure that resources are distributed and used in an optimal fashion among services and that specific operational metrics are met.
- Security policy management: Service mesh provides the ability to define and manage policies based on identities, e.g., who can talk to whom. Additionally, you can also apply organizational policies to govern the interaction between services.
- Traffic management: With a mesh network, it’s fairly easy to regulate traffic between services using route rules. For example, Istio exposes a set of APIs that allows you to set fine-grained traffic rules. This also includes automatic routing policies that can make the service requests more reliable when the network face adverse conditions.
The best part of using service mesh is that you can dynamically change the policies without changing any of the application code.
For an organization, some of the top use cases for service mesh include:
- Service discovery: Often organizations do not know which services are running in their infrastructure — this issue is exacerbated in a microservices environment. Service mesh provides service-level visibility and telemetry, which helps an organization with service inventory and dependency analysis.
- Operation reliability: The telemetry data service mesh provides allows you to see how well a service is performing – how long did it take it to respond to service requests, how much resource it is using, even include how often the service is used, etc. This lets you spot issues as they rise and correct them before they impact the larger application environment.
- Fine-grained traffic governance: An organization may desire blocking of specific content or specific URLs or sub-URLs. With service mesh, you can configure the mesh network to perform these fine-grained traffic management policies without going back and changing the application. This includes services within a specific mesh, as well as ingress and egress traffic to and from the mesh.
- Secure service-to-service communications: With the concept of universal service identity, it is feasible to enforce mutual TLS for service-to-service communications. In addition, you can also enforce service-level authentication using either TLS or JSON Web Tokens (JWS).
- Trust-based access control: Instead of governing access based on static attributes such as user identities, IP addresses, or access control lists. Service mesh extracts real-time host and network telemetry data, based on which you can make dynamic trust-based access control decisions. For example, you can specify a policy that a service request can only be granted based on the location where the request came from, or that a Certificate Signing Request (CSR) can only succeed if the requester passes the health check.
- Chaos engineering and testing: Service mesh includes specific functions to perform fault injection and test the resiliency of your services. For instance, Istio lets you inject specific delays in service responses and test how the application as a whole behaves. Injecting delays is a chaos engineering technique and has been shown to help increase resiliency of the overall systems.
One Mesh to Rule Them All — The Promise of Service Mesh for Multicloud
This year at DockerCon, the message of supporting multicloud is heard loud and clear. Not only Docker came out with slick multicloud support, there are many third-party capabilities designed to support multicloud workloads and easy movement between different cloud infrastructure.
Service mesh is ideally suited for multicloud, as it offers a single abstraction layer that obscures the specifics of the underlying cloud. Organizations can set policies with the service mesh, and have them enforced across different cloud instantiations.
Istio, as a service mesh, provides strong multiplatform features. Istio’s Mixer adapters abstract away the infrastructure backend information behind a single, consistent API that performs logging, monitoring, ACL checking, and other functions. Note these features can be within a single cluster, across different clusters, or even across different cloud platforms.
For example, if you have five clusters globally and you run your application in three of the clusters. To minimize cost and maximize performance, you may want real-time visibility to determine which three clusters to run your application and how to evenly distribute the load. These are the tasks that Istio can perform elegantly, automatically, and without human intervention.
The Current State of Service Mesh
Today, the open source landscape for service mesh includes Linkerd, Istio, and Conduit. For more information on the details of the different service mesh networks, check out this great article on The New Stack that compares the different projects.
Both Istio and Conduit have good support for Kubernetes. Google, Lyft, and IBM are the initial entities behind Istio. Istio’s strong integration with Kubernetes, nice traffic management features, and its promise for true cloud-agnostic management are helping to garner a strong momentum for Istio in the cloud native community.
However, today it is not trivial to install Istio and manage it. One of the big challenges is translating organizations’ complex management needs to Istio policies, which is not a straightforward process. This is why startups like Tetrate, founded by Istio project principals, are gaining attention not only because of its expertise to help organizations set up and run Istio, also because its promise to deliver technology to enrich mesh functionality, automate mesh operations, and achieve central management across clusters and clouds.
Zack Butcher, an engineer with Tetrate, delivered two Istio related sessions at this year’s DockerCon, including an Istio training session. Butcher noted that the industry is hungry for a tool like Istio that offers “simple yet powerful abstractions to solve complex problems.”
We expect more innovations to enter the service mesh space. Over the next 18 months, we should see a great deal of improvements, including…
- Further integration between data planes and control planes. Tighter integration between the data plane and the control plane will lead to better performance and potentially richer controls.
- More API development for telemetry and control: We should see new APIs emerge to offer deeper telemetry, easier queries and more sophisticated control of services. Controls like fine-grained data access — which service can access which piece of data — are capabilities that will emerge with new offerings in the service mesh space.
- Coverage of specific cloud environments: There is a growing demand for Istio to deepen support for specific cloud production environments, such as Azure container services, Openshift environments, etc.
Microservices deliver powerful business benefits, but represent the wild west of application deployment and operations. Operations and Security teams often lose visibility and control when the organization adopts microservices. With service mesh, the hope is that the organization can continue to harvest the automation and efficiency benefits of microservices, but at the same time attain much-needed transparency and management from an operations standpoint.