Microservicing with Envoy, Istio and Kubernetes
When organizations talk about microservices, we talk about using microservices as a vehicle for building business-agile IT systems: systems that enable a business to more quickly change, build new functionality, experiment and stay ahead of disruptors and competition.
The industry tends to romanticize microservices, and often for good reason, but the truth is there are a lot of hard parts to microservices. From a technology perspective, building microservices means building distributed systems. And distributed systems are hard.
Kubernetes, for instance, is a great deployment backbone. Container deployment platforms have become a boring part of our infrastructure in recent years thanks to platforms like Kubernetes and OpenShift. The exciting parts, unfortunately, happen when services actually try communicating and working together to accomplish some business function. As we start building service architectures or applications that communicate over the network, we have to tackle some nasty distributed-systems problems. As it’s been said, we cannot ignore the fallacies of distributed computing.
So, what happens when we send a message to a service? That request gets broken down into smaller chunks and routed over a network through a series of hops, control points, and firewalls.
We deal with the fallacies of distributed computing because of this “network.” Applications communicate over asynchronous networks, which means there is no single, unified understanding of time. Services live in their own understanding of what “time means” and that is likely different from other services. More to the point, these asynchronous networks route packets based on availability of paths, congestion, failures in hardware, etc. There is no guarantee a message will get to its intended recipient in bounded time. The network does what it wants.
This uncertainty in the network makes it relatively impossible to determine failure or just slowness, and more importantly the cause of this.
As we move to services architectures, we push the complexity to the space between our services. These application networking complexities that need to be solved for include:
- Service discovery
- Load balancing
- Rate limiting
- Thread bulkheading
- Circuit breaking
These are all horizontal concerns and apply to services regardless of implementation. It shouldn’t matter that the service was written in Java, Go, Python or Node.js; we expect them to all behave the same when solving for these resiliency problems. Moreover, how the application networking is implemented should be transparent to applications. When network and application problems do occur, it should be easy to determine the source of the problems. This sounds great in theory, but how do we go about doing this?
What if we could implement this functionality once, in a single spot, and let any language/framework/implementation use it?
Enter the Service Mesh
A service mesh is a decentralized, application-networking infrastructure between your services that provides resiliency, security, observability, routing control and, most importantly, insight into how everything is running. A service mesh is comprised of a data plane through which all traffic flows and a control plane to manage the data plane. Proxies make up the data plane, and the traffic between applications flows through these proxies. The control plane is responsible for managing and configuring proxies to route traffic, as well as enforcing policies at runtime.
The service mesh is a paradigm that has emerged to help make service communication boring. It enables us to push application network functions down into the infrastructure with minimal overhead and high decentralization with the ability to control, configure and monitor application-level requests — tackling some of the above issues.
As services architecture becomes more heterogeneous, it becomes more difficult (or impractical) to restrict service implementations to specific libraries, frameworks or even languages. With the evolution of the service mesh, we’re seeing some of these resilience patterns, like circuit breaking, implemented as language/framework-independent solutions in the infrastructure.
With the service mesh, we’re explicitly separating application network functions from application code, from business logic, and we’re pushing it down a layer into the infrastructure — similar to how we’ve done with the networking stack, TCP, etc.
Meet Envoy Proxy
With a proxy, you can abstract a functionality to a single binary and apply it to all services, regardless of what language you’re using, and have all the traffic run through a centralized point. Again, this makes up the data plane in a service mesh. This in turn:
- Enables heterogeneous architectures.
- Removes application-specific implementations of this functionality.
- Consistently enforces these properties.
- Correctly enforces these properties.
- Provides opt-ins as well as safety nets.
Envoy proxy is a great example of a proxy that provides this. Originally built at Lyft, Envoy is a high-performance proxy and provides the foundation for a service mesh. It runs alongside the application and abstracts the network by providing common features in a platform-agnostic manner. When all service traffic in an infrastructure flows through an Envoy mesh it becomes easy to visualize problem areas via consistent observability, tune overall performance and add features in a single place.
Service proxies like Envoy can help push the responsibility of resilience, service discovery, routing, metrics collection, etc., down a layer below the application. Otherwise, we risk hoping and praying that the various applications will correctly implement these critical functionalities or depend on language-specific libraries to make this happen.
The proxy architecture provides two key pieces missing in most stacks using services architectures — robust observability and easy debugging. Having these tools in place allows developers to focus on another important aspect that requires their attention: the business logic.
Meet Istio Service Mesh
Istio.io is a natural next step for building microservices by moving language-specific, low-level infrastructure concerns out of applications into a service mesh, enabling developers to focus on business logic. It serves as the control plane to configure a set of Envoy proxies. Although Istio was written to support Kubernetes originally, it is not tied to Kubernetes and can be run on any platform, including in a hybrid architecture across multiple platforms.
The project was initially sponsored by Google, Lyft and IBM, and uses an extended version of the Envoy proxy, which is deployed as a sidecar to the relevant service in the same Kubernetes pod. It has garnered attention in the open source community as a way of implementing the service mesh capabilities. These capabilities include pushing application-networking concerns down into the infrastructure: things like retries, load balancing, timeouts, deadlines, circuit breaking, mutual TLS, service discovery, distributed tracing and others.
One of the most important aspects of Istio is its ability to control the routing of traffic between services. With this fine-grained control of application-level traffic, we can do interesting resilience things like routing around failures, routing to different availability zones when necessary, and more importantly, we can also control the flow of traffic for our deployments so we can reduce the risk of change to the system.
Istio enables several higher-order cluster semantics, including:
- Service observability
- Graduated deployment and release
- Policy enforcement
- Cluster reliability
- Chaos testing
- Fleet configuration
- Strong security options
There will be blurry lines. In a service mesh, we’re saying that our application should be aware of application network functions but they should not be implemented in the application code. There is something to be said about making the application smarter about what exactly the application network function/service mesh layer is doing. It’s likely that we’ll see libraries/frameworks building in some of this context.
For example, if Istio service mesh raises a circuit breaker, retries some requests, or fails for a specific reason, it would be nice for the application to get more understanding or context about these scenarios. We would need a way to capture this and communicate it back to the service. Another example would be to propagate tracing context (distributed tracing like OpenTracing) between services and have this done transparently. What we may see is these thin application/language specific libraries that can make the application/services smarter and allow them to take error-specific recourse.
Either way, we’re now just starting to see implementations of Envoy and Istio being deployed into production with Kubernetes and Red Hat OpenShift, and feedback so far has been positive. Istio in some cases is even helping to set up those just coming into the space to even leapfrog two or three generations of microservices work. So while we’re still in the beginning phases, there are many different ways to set up the technologies in a way that works best for your application.