Linkerd’s Little Secret: a Lightning Fast, Service Mesh Focused Rust Network Proxy

KubeCon + CloudNativeCon sponsored this post, in anticipation of KubeCon + CloudNativeCon EU, Aug. 17 – 20, virtually.

A service mesh like Linkerd can deliver critical features such as transparent mutual TLS, gRPC load balancing, blue-green deploys, and golden metrics. But like all abstractions, these features come at a cost. Some of this cost is human in nature: the more complex the service mesh, the more effort required to operate it successfully. Some of the cost is system cost: a service mesh consumes CPU and memory, and introduces latency to the application.
Linkerd’s goal is to minimize this cost by being the smallest, fastest service mesh for Kubernetes (a claim which has been verified by third parties). But just how does it achieve this feat? In this article, we reveal Linkerd’s secret sauce: a lightning fast Rust proxy called simply Linkerd2-proxy. Unlike general purpose proxies such as Envoy, NGINX, and haproxy, the open source Linkerd2-proxy is designed to do only one thing and do it better than anyone else: be a service mesh sidecar proxy.
In fact, we believe that Linkerd2-proxy represents the state of the art for secure, modern network programming. It is fully asynchronous and written in a modern type-safe and memory-safe language. It makes full use of the modern Rust networking ecosystem, sharing foundations with projects such as Amazon’s Firecracker. It has native support for modern network protocols such as gRPC, can load balance requests based on real-time latency, and do protocol detection for zero-config use. It is fully open source, audited, and widely tested at scale.
But things weren’t always this way. In fact, Linkerd2-proxy started out as something of a gamble. In 2018, the Linkerd team made the difficult call to rewrite Linkerd and move away from the JVM-based “Twitter Stack” of Scala, Netty, and Finagle in Linkerd 1.x. It was clear that the control plane should be written in Go, the lingua franca of the Kubernetes ecosystem. But what about the proxy? Should Linkerd 2.0 be built on top of Envoy? NGINX? Something else?
As we evaluated the options, we ended up going in a different, riskier direction: we decided that if we really wanted to build the fastest, smallest service mesh, none of these options would do. What we really needed was a new proxy, specific to the service mesh use case. And it should be built in Rust.
We took this path for three reasons:
-
- Security. Service mesh data plane security is paramount. We knew that the proxy would be responsible for highly sensitive information, such as customer PII and HIPAA- and PCI-subject data. Rust’s guarantees around memory safety allowed us to avoid a whole class of common vulnerabilities and CVEs that could otherwise result in a major security vulnerability.
- Minimal footprint, maximum performance. Second only to security, performance and resource cost were critical. We needed to do absolutely everything we could to reduce the latency, memory, and CPU usage of Linkerd’s data plane. We knew that writing a proxy specifically for the service mesh use case would allow us keep things as lean and mean as possible.
- Simplicity. Reducing the complexity of the system isn’t just a nice to have, it’s the core determiner of the human cost to operating a service mesh. We needed Linkerd to “just work”, and that meant avoiding all the configuration, tuning, and operational complexity that are part and parcel of a general-purpose proxy.
The choice not to use Envoy, in particular, was a tough call — especially given Envoy’s popularity in the Kubernetes community. However, in the end, Linkerd’s requirements around resource footprint and security were simply too restrictive for Envoy to be a realistic choice. Envoy was a Swiss Army knife, when what we needed was a needle.
Today, Linkerd2-proxy has powered billions of requests at organizations around the world. It passed its third-party, CNCF-sponsored security audit with flying colors, and sits at the heart of every Linkerd installation — usually at a few megabytes of memory and a sub-millisecond p99 latency! Perhaps most interestingly, Linkerd2-proxy does its magic sight unseen: like a good implementation detail, the vast majority of Linkerd users deploy vast numbers of Linkerd2-proxy instances and barely know they’re there.
Over the next few weeks, we’ll be sharing a lot more about the inner workings of Linkerd2-proxy and some of the lessons we learned along the way in developing a modern, secure, Rust-based network proxy.
To learn more about service mesh and other cloud native technologies, consider coming to KubeCon + CloudNativeCon EU, Aug. 17 – 20, virtually.
Feature image via Pixabay.
At this time, The New Stack does not allow comments directly on this website. We invite all readers who wish to discuss a story to visit us on Twitter or Facebook. We also welcome your news tips and feedback via email: feedback@thenewstack.io.