Matt Klein on the Success of Envoy and the Future of the Service Mesh
I sat down with Matt Klein, creator of the Envoy proxy and software engineer at the ride-sharing service Lyft during last week’s Service Mesh Days in San Francisco. “I greatly underestimated the need for a general mesh,” Klein said during his opening keynote at the conference.
I first met Klein in January 2017, at the Microservices Practitioner Summit where he gave his first public talk on Envoy. He talked about how Lyft moved its monolithic stack to a Service Oriented Architecture (SOA) and he wrote Envoy to as a fast sidecar ‘communication bus’ handling things like rate limiting, load balancing, circuit breaking, service discovery, and active/passive health checking, etc.
Today, Envoy is on open source projects sponsored by the Cloud Native Computing Foundation (CNCF) — and one of three to “graduate” to the status of full-production readiness, alongside Kubernetes and Prometheus. It is being used by Microsoft, eBay, Google, Airbnb and Amazon Web Services. And Klein himself is on the CNCF’s Technical Oversight Committee.
Varun Talwar, founder of Tetrate and co-creator of the Istio service mesh (Oft used with Envoy) said Envoy is one of the first, if not the first products, to gather data on services instead of IP addresses, and I didn’t realize that.
I think that has a lot to do with why it’s been adopted. I mean, it’s the first communication metric you talk about as a communication bus that is completely cloud-native and not dependent on hardware…
I’m not going to say that Envoy is the first thing to do layer 7 or application load-balancing. There are actually plenty of cases that have come before, particularly from the library perspective. For example, JRPC had already been out doing some client load-balancing stuff, obviously, Twitter had had Finagle, Netflix has a whole suite of technologies around Hystrix.
It’s not that Nginx and HAProxy could not do some amounts of layer 7 load-balancing. They certainly could. I think Envoy came along for some of the reasons I talked about in my talk. I think it’s become very popular for a couple of different reasons, but I think it has allowed people to gather application layer information and do routing load balancing and help checking in an application-centric way in a more accessible manner that has been previously possible.
I’m not going to claim that it’s an easy thing, but in the right setup it can be utilized to provide really robust information and features from an application perspective.
Right. As we were listening to the panel, I was sitting next to Redbeard from Red Hat, and he said, “I tell people this all the time. It’s like don’t start using it until you’re at the place where it becomes useful.”
Totally agree, and that’s actually when people are always asking me about service mesh and technology, and I actually feel exactly the same way as the panelists where I always counsel people: only take the complexity that you need. Start with the monolith, start with a simple database, start with whatever you need to get your product working and only take on the complexity that you need.
I think we have a tendency to both over-complicate things in technology, and we also have a big tendency—it’s human nature—to underestimate how much things cost, and I don’t mean dollar cost, I mean time cost.
People are terrible about estimating not just the initial implementation time but the bug fixing time, the maintenance time, the support time. It’s huge.
They should have a class on that in code school.
Because nobody in my entire career, which is now almost 30 years, has ever been able to accurately estimate how long implementation will take.
Even extremely experienced people or technically strong people do a terrible job of not just time estimation but what I call TCO, total cost of ownership. That’s what I’m saying is that to include not just development time but maintenance, support, bug fixing, the whole thing. People don’t estimate that well.
Can you talk a bit about the process of moving a CNCF open source project like Envoy graduation?
So there are three stages from the CNC project perspective now. There’s what they call Sandbox, then incubation, and then graduation.
I will say that I think from the TOC perspective, I think we are thinking about refining some of the requirements to move between stages, but I think at a high level, Sandbox is a place for mutual governance where very low requirements to get in, foundation’s not going to provide a huge amount out but it’s a place where a project can get in easily, we can see if the project takes off, have a place for mutual collaboration.
I think the idea behind incubation is a Sandbox project that has gotten some traction, is seeing some production use, but I would say the big difference, at a very high level, between incubation and graduation is there’s still a chance that an incubating project might fail, right? Like maybe it’s primarily backed by one company who’s venture-funded, or maybe there’s really only one developer who’s doing all of the coding and maintaining.
Tt’s open source but it may not be, I would say, long-term sustainable if something happens. Whereas I think the idea behind graduation is that if a project becomes graduated, of course, things can come out of fashion and technologies can die, but there should be no people reason why that project should cease to function. Meaning if a company goes out of business the project should continue. If a primary developer goes away it should continue.
What lessons have you learned going through these different stages? I mean, the last two years have been like a whirlwind.
Yeah. To be perfectly honest with you Envoy has grown so much and so fast that I didn’t really give a lot of thought to the stages, just because we’re so busy. I think sometimes projects are trying really hard to grow. I think in many ways I’ve had the opposite problem where it was growing so fast that it was very hard to keep up. I’ve said this multiple times now, but in 2017, it was a tough year for me. It was very stressful. Project was exploding, trying to scale myself and the other maintainers and get other companies on board, and obviously in 2017 Google came online in a big way.
In 2018, now, we have tons of companies that are contributing and active maintainers or contributors from all over the place, so I’m just going, to be honest. I didn’t give a lot of thought to moving between the stages.
Can you share us a couple of the most interesting and surprising use cases that people have taken?
There’s so many. In the last six months, I’ve found out that eBay had, without ever really talking to me, had gone off with a team of 10 people and basically replaced their entire Edge serving infrastructure with Envoy. They had built tools to basically translate their, I don’t know if it was F5 or Netscaler configurations, to Envoy configurations. Pretty incredible actually, so that’s just one example. We see tons of startups now just in security and observability and all types of things that are building their products on top of Envoy.
As we said during the user panel, Envoy is associated with service mesh, but it’s not service mesh specific. It’s a proxy, so people use it for API Gateway cases or for middle interior proxy cases or for clients load balancing cases. It’s used all over the place.
If you actually look at our GitHub page, we call it a Cloud Native Proxy. We don’t really call it a Service Mesh Proxy, right? It tends to be used in that way because it is a very flexible system, it can be API-driven and extended and a bunch of other stuff, but I don’t personally think of it as being tied to service mesh.
Okay, great. So, looking to the future, and there’s a panel later today which is talking about the future of service mesh from a user perspective, but from your perspective, where do you see Envoy headed?
I think we’ve had such phenomenal growth, it’s been incredible. The adoption is incredible that I can’t believe I’m saying this, but I think we are well on our way to Envoy, not necessarily the products built on top, but Envoy becoming ubiquitous.
What I actually think will happen in a five-year timeframe, ’cause I think many people will interact with Envoy but they won’t know that they’re interacting with Envoy. They may be using like a Fargate or an AKS or a Google Cloud function or something like that, and under the hood very likely to use Envoy to provide a bunch of features that people will do timeouts and retries and service discovery and all those things, but they’ll just be built into the platform and the platform docs. They won’t know that it’s Envoy.
So Envoy becomes, like I talked about in the talk, a generic piece of plumbing that people do all kinds of things with, so that’s actually where I see it going. I see it becoming ubiquitous but actually mostly hidden from the way that most people interact with it. In the same way that in the five to 10-year timeframe, I don’t think a lot of people are going to interact with Kubernetes either. It’s plumbing, like schedules, compute. As much as I think that we’re very far away from … I don’t like the term “serverless.” I do believe in the vision, which is that ultimately people want to run their applications, want to talk to databases and they want to make network calls.
If we can give them those capabilities without having to worry about the scheduling, the plumbing, the networking, of course, they’re going to use that, so if I look out into that timeframe, Envoy gets hidden, like Kubernetes gets hidden. I think a lot of these things become lower level implementation details.
Is there any work being done on standards to bridge the different service mesh systems?
Not that I know of currently. I know that Istio is doing something around their mesh control protocol thing. I think there’s increasing chatter about what is our Envoy APIs, or XDS APIs? What does it mean if things other than Envoy start using them?
This is actually coming up now because it was mentioned during one of the talks, but gRPC has its own built-in load balancing and the plans to actually replace that with Envoy APIs so that a single control plane can tell the gRPC clients to do X, Y or Z and then the Envoy can do X, Y or Z.
But in that scenario now, it’s not just Envoy that’s using these APIs, so we probably have to be a little more rigorous about versioning and deprecation and a bunch of stuff than what we’ve done previously.
Is that something that the CNCF can work with?
Probably not. I think that’s a project thing that we’re going to have to figure out, and I think there are actually even some people had asked, “Well, would you ever consider doing IETF for the Envoy APIs?”
I don’t think we’re opposed to it, I think it’s just, again, even in that type of engineering I tend to take the philosophy of only take on the complexity that you need, right? So if people don’t need us to have IETF standardization, I’d rather not do that. If people need it, then we should talk about it, because if there’s a lot of overhead and that type of stuff.
I think some people think that my life is a lot more glamorous and glorious than it actually is, just in the sense that maintaining or managing, and for your article I’m using air quotes right now, like “managing” an open source project it’s chaos because it’s like controlled anarchy, right?
The Cloud Native Computing Foundation is a sponsor of The New Stack.