Linkerd 1.0, a Communications Service Mesh for Cloud-Native Apps
With the rising popularity of cloud-native applications, reliability and performance become far more complex. To get a handle on these issues, the developers behind the open source Linkerd project advocated for a separate communication layer to transparently handle aspects such as service discovery, load balancing, failure handling, instrumentation, and routing for services.
It totally decouples this communication layer from the application code. That means developers can concentrate on writing application logic, not networking code. Developers don’t even have to care about these operational aspects, according to William Morgan, CEO of its sponsor company Buoyant.
Engineers Morgan and Oliver Gould founded Buoyant in 2015 after about five years at Twitter where they worked to vanquish the notorious “Fail Whale.” They found that many of the problems there originated in service-to-service communication, which ultimately would bring the whole site down.
They sought to bring tools to ensure reliability that originated during Twitter’s move from a monolithic Ruby on Rails to a new infrastructure — “We didn’t even know what to call it,” according to Morgan.
Linkerd (pronounced “linker-DEE”) began from a “cold start” in February 2016 — no users, no adopters, no contributors, Morgan said. A little over a year later, it’s being used by companies such as Credit Karma, PayPal, and Ticketmaster and become a project of the Cloud Native Computing Foundation.
The San Francisco-based company is touting that it has reached 1.0 after spending the past year focused on performance, reliability and scalability.
Key features of the 1.0 release include a service mesh API; end-to-end reliability features such as load balancing, circuit breaking, request routing, deadline propagation, and retry management; security and encryption features such as transparent Transport Layer Security (TLS); distributed tracing and fine-grained instrumentation; and integrations into the cloud-native ecosystem, including the Kubernetes container orchestration software, the Prometheus monitoring tool, and gRPC.
“This is something companies could use as they move to the cloud-native stack. It’s not that useful if you have a big, monolithic application — it’s really only useful if you’re moving to something like Docker and Kubernetes,” Morgan said.
Seeking Consistency, Reliability
Linkerd grew out of Twitter’s Finagle project, an extensible Remote Procedure Call (RPC) system for the JVM designed for high performance and concurrency that implements uniform client and server APIs for various protocols. Built on top of Netty, a non-blocking I/O (NIO) client-server framework, and Finagle, Linkerd allows a microservice to request services from a program located on another network without having to input its network details.
Morgan and Gould, however, wanted it to run across any languages and any infrastructure, not just Java.
“People have been writing this logic into their applications. Ten years ago, people were writing three-tiered applications — web server, application logic, database. Only the web server and database talked to each other. You’d write this very sophisticated logic for retries, timeouts — it was specific to the two hops,” Morgan explained.
When companies like Facebook and Twitter had to expand this, they’d write it as part of a library, such as Finagle or Google’s Stubby (Google’s internal remote RPC, which formed the basis for gRPC). Service mesh is essentially an extension of that.
“We’re saying it should be a separate layer; it should be a proxy you run alongside it. It’s totally decoupled from the application itself. That’s especially relevant in the cloud-native world because of things like Docker, people are writing these polyglot applications that have hundreds of different languages, frameworks, and developers get to pick and choose what they want. With a library, it becomes very difficult to have consistency [across it all].”
Its biggest hurdle, he said, has been to get companies to accept the reasoning behind adding another level of abstraction. In a blog post, he makes the case for Linkerd’s concept of a service mesh, through which all service requests take place.
“In some ways, the service mesh is analogous to TCP/IP. Just as the TCP stack abstracts the mechanics of reliably delivering bytes between network endpoints, the service mesh abstracts the mechanics of reliably delivering requests between services. Like TCP, the service mesh doesn’t care about the actual payload or how it’s encoded. The application has a high-level goal (‘send something from A to B’), and the job of the service mesh, like that of TCP, is to accomplish this goal while handling any failures along the way,” he wrote.
However, the service mesh goes further to provide a previously non-existent level of visibility and control to help operators understand where problems lie.
It can handle complexity by applying dynamic routing rules, choosing the instance most likely to return a fast response, performing retries on another instance, and if that doesn’t work, failing the request rather than adding load with further retries.
“Large-scale distributed systems, no matter how they’re architected, have one defining characteristic: they provide many opportunities for small, localized failures to escalate into system-wide catastrophic failures. The service mesh must be designed to safeguard against these escalations by shedding load and failing fast when the underlying systems approach their limits,” he wrote.
But it also can go further, performing protocol upgrades, dynamically shifting traffic, failing over between data centers and more.
Linkerd normally uses about 100MB of memory on the host server. For the future, Morgan has said the goal is to make Linkerd smaller, faster and lighter. To that end, it announced at the end of March Linkerd-tcp, a lightweight TCP load balancer for occasions when it’s sufficient to simply proxy TCP.
For now, Linkerd supports only HTTP (and HTTPS) requests, functioning as a web proxy. Support for more protocols and support for raw TCP also are in the works.
“A lot of people are confused about why something like this is necessary, and [in my opinion] it really comes down to one thing: sometimes the stuff you want to talk to moves around,” one observer wrote in a Hacker News discussion of Linkerd.
Going forward, service mesh is well-positioned to be part of the evolution to serverless architecture, Morgan said.
“That’s still an ambiguous space,” he said. “Cloud mesh smooths the way for moving to a serverless approach — decoupled totally not only from what the hardware looks like … but also from individual long-running services. In the serverless world, we have these functions, and they do something, then go away. Service mesh sets you for managing communication between functions just as much as for managing communication between servers. That’s largely future work for us.”
The Cloud Native Computing Foundation is a sponsor of The New Stack.
Feature Image: “Waiting a call” by Karlis Kadegis, licensed under CC BY-SA 2.0.