Behind the Scenes of Lyft’s New Application Service Mesh, Envoy
Last January, Lyft engineers shared how the car-sharing service moved its monolithic applications to a service-oriented architecture (SOA) by way of Envoy, a home-grown, self-contained proxy server and communications bus. Envoy was open sourced last September. At the recent Microservices Virtual Summit, Matt Klein, software engineer at Lyft, spoke more about sharing the mechanics of how they deployed Envoy.
Envoy, Klein explained, is a service mesh with an out-of-process proxy that accepts all of the application traffic. “It basically makes the network transparent to applications,” he said.
Originally, Lyft had a fairly simple service oriented architecture, albeit one with a monolithic architecture built on MongoDB, PHP/Apache applications, and an Amazon Web Services’ Elastic Load Balancer. It was a distributed system but offered no observability as to how these components spoke to each other.
Envoy started with the edge proxy, he explained, because most microservices systems will need some type of L7 edge proxy as they scale. “You are taking traffic, and then you have to parse that traffic and send it to various backend systems,” he said. Unhappy with the feature sets of existing edge proxies, they decided to roll their own, mostly because they wanted more observability than commercial edge proxies provide.
“We decided to run Envoy, just as a TCP proxy, on each monolithic server,” Klein explained. By doing this they were able to collapse the connections coming into each local Envoy, and also limit the number of connections going into their Mongo database, which is notorious for not handling large connections well.
Deploying such a large piece of architecture is hard work, Klein cautioned, and it doesn’t happen overnight. They launched Envoy in incremental steps, letting each step show its value. Rolling out just this edge layer provided huge benefits in terms of debugging production incidents.
The Envoy dashboard shows not just basic edge load balancer data that most operations dashboards show, but goes much deeper, Klein said. “We have request-per-second to each upstream cluster or service. We have failures on a per service basis.”
This observability led to the realization that they could do a lot more. They parsed the L7 MongoDB traffic, which gave them incredible stats, Klein said. This led to putting call sites into the actual queries, which they then parsed out at the Envoy layer.
The next step was adding a rate limit filter into Envoy. All of this gave them a huge operational benefit, by being able to see and understand what is going on in their stack. And, Klein said, they were able to prevent some pretty bad failure scenarios.
Changing the Service Discovery Paradigm
At this point, they almost have a mesh but are still using internal load balancers to actually do service discovery.
After a lot of evaluation, the team realized is that from a microservices perspective, service discovery is eventually consistent, said Klein. Once they realized that hosts don’t have to have a consistent view and that they just had to have coverage, the engineers found that it would be easy to build a “dead simple API,” he said.
Most companies that use fully consistent systems end up having a variety of outages….
So the data is fully, eventually consistent. It may be five to ten minutes before things converge, but consistency will be reached. In two and a half years, they have not had a single incident based on this system, Klein said. Most companies that use fully consistent systems end up having a variety of outages, he said. “So this has been very good to us.”
Making the Mesh Magic
Next up: propagation. Lyft already uses IDs for tracing and logging interactions, but Envoy still had to propagate headers.
They went for simplicity, he said, requiring any element that included an Envoy client to use a thin library. This allows them to guide developers in terms of good practices around timeouts, retries, and various other policies.
The next step is to run 100 percent of their traffic through Envoy. About a year in, Envoy covered about 90 percent of the traffic, but to get the real benefit, it has to be everywhere. So they began a months-long burndown process, which Klein described as “a slog.”
The Lyft engineers needed to work with other teams to fully deploy Envoy. Because the Envoy had been running for over a year, developers had already seen the stats coming from Envoy so they understood the value, making buy-in easy. “Once people see this magic of this service mesh, it’s like a drug,” Klein said. “It’s just a very powerful paradigm. So once people see it, they don’t really want to be without it.”
What’s Next: Versioning the APIs
Once the process was complete was when the real benefits started to pay off, he said. Now deploying new features is a breeze and the stats are amazing.
Envoy’s current APIs are all written in REST/JSON. For version 2, they will be moving to gRPC while still supporting JSON. By moving to gRPC APIs, they will gain bi-directional streaming, said Klein. They are making their APIs more robust.
Now called the “Endpoint Discovery Service” (EDS), the Envoy API will be able to report on host info, such as CPU load, memory usage, and other host data. That will potentially allow a very sophisticated management server to actually use load information dynamically to determine assignments.
If he were to start from scratch today, said Klein, he wouldn’t have Jinja-based JSON system with static configurations. But that’s how things developed with the technology available at the time. He’s very excited about the appearance of Istio. It “provides a uniform way to connect, manage, and secure microservices. Istio supports managing traffic flows between microservices,” according to the Istio.io web page.
Istio is really a decoupling of the control plane from that data plane, explained Klein, which he finds very exciting. “There’s so much work that has to go into making a control plane robust, and have proper security and roll back and roll forward,” he said. “And it’s important to decouple that control plane layer from that data plane layer.” So the Istio control plane will basically implement all of the Envoy APIs.
Klein envisions Istio as the vehicle to get Envoy to a much broader set of people. He’s very excited about where they are and about improving the system.
He’s also enthusiastic about the industry move towards microservices. They are going to need to figure out how to manage Envoy’s new functionality going forward, he said. He thinks some of the microservices will remain proprietary. “As you pop higher in the stack, these systems tend to get more and more domain-specific. They tend to be built into Lyft infrastructure, he said. It’s getting harder and harder to have the right abstractions in place. But, he said, “I am very excited about that.”
For an in-depth analysis with full details, watch the talk here:
Feature image by Raphael Schaller on Unsplash.