KubeCon+CloudNativeCon: Service Mesh Battle Stories and Fixes
Honeycomb is sponsoring The New Stack’s coverage of Kubecon+CloudNativeCon North America 2020.
As more organizations implement service meshes, they are finding what works and what needs more work, and they are creating new management practices around this knowledge. A few tried-and-tested best practices were detailed last month during KubeCon+CloudNativeCon.
“There’s a lot to say about each of these service meshes and how they work: their architecture, why they’re made, what they’re focused on, what they do when they came about and why some of them aren’t here anymore and why we’re still seeing new ones,” Lee Calcote, founder of Layer5, explained during his talk with Kush Trivedi, a Layer5 maintainer, entitled “Service Mesh Specifications and Why They Matter in Your Deployment.”
Service mesh is increasingly seen as a requirement to manage microservices in Kubernetes environments, offering a central control plane to manage microservices access, testing, metrics and other functionalities. One-third of the respondents in The New Stack survey of our readers said their organizations already use service mesh. Among the numerous service mesh options available; Envoy, Istio, Linkerd and Kuma are but a few on offer.
How to Avoid Service Mesh Hiccups
Car-sharing platform and Uber competitor Lyft created and open sourced the Envoy platform a few years ago to help achieve better control and management of the services and containers running within Kubernetes clusters. Envoy has subsequently emerged as a leading control plane, an essential component of a service mesh.
During their talk “Safely Deploying a 100K line Envoy YAML Configuration to Production,” Lisa Lu, a research fellow Stanford Law School and former Lyft software engineer, and Jyoti Mahapatra, a networking team software engineers for Lyft, discussed how to avoid common configuration issues when implementing and using Envoy. They also discussed how some of the guardrails built into Envoy can help mitigate software production release and performance problems.
Considering Lyft’s large and expansive cloud native infrastructures, Lu and Mahapatra had firsthand knowledge about managing Envoy at scale.
“As the number of services and routes that Lyft has grown on some sidecars, we have configurations that are upwards of 100,000 lines of YAML, which makes maintenance and modification extremely complex and risky,” Lu said.
Lu and Mahapatra described best practices for automating testing and conflict validation processes “so that your service owners can iterate quickly and independently,” Lu said. “This will also prevent your envoy operators from being bogged down with code reviews,” she said
Human error represents a major source of service mesh bottlenecks and outages. In the case of when a user overlooks a service or routing conflict with Envoy that, as Lu described, can “derail” a service launch or “deprecation,” Lu showed how it was possible to validate the Envoy’s bootstrap configuration to avoid the issue. One of the fixes involves running the Envoy binary in validation mode, Lu said. The process “takes your binary and it takes your bootstrap config and it tries to boot up Envoy,” and “it will go through the server initialization process as far as it can, and if there are no errors, it will exit successfully,” Lu said.
A conflict load check tool — a standalone binary that can be used for all bootstrap configs within an Envoy-managed Kubernetes cluster — checks that all the values and the fields, as defined by the proto and the JSON schema, are valid. Lu said both checks are run for all pull requests and continuous integration (CI).
These tools “make reviewing these changes so much easier because, as a reviewer, you already know that the config is a valid Envoy config,” in order to “ensure that the change does what it should,” Lu said.
Service mesh has been compared to DNS in reverse in that DNS offers a vertical-like check of network traffic, while the service mesh extends laterally to ensure continuity between the different services and sidecars within a cluster.
Mahapatra described an Envoy router check tool created to help prevent router-related glitches — or even outages in extreme cases. In a “high flux change scenario a route mistakenly added at the top of the list… can disrupt all traffic,” for example, Mahapatra said. The router check functionality helps to mitigate the risk by allowing the user to run unit tests, check field limitations, add code coverage constraints and test complex routing configurations based on header match, runtime and cluster conflicts, he said
Lyft’s contributions to Envoy continue to often, if not mostly, involve solutions to problems Lyft developers face. This was the motivation behind the development of the router check tool for Envoy, which didn’t always have the capability to calculate coverage, or have test support for runtime valleys and flags, for example, Lu said.
Meanwhile, “there are still a bunch of features that users want,” Lu said. “So there’s definitely plenty of room to increase the kind of behavior that can be tested with the router check tool,” Lu said.
Service Mesh to Grow a Grocery Chain
San Antonio-based grocery store outline H-E-B faced a surge in demand for curbside grocery pickup in the wake of the COVID-19 pandemic that began last year. While the retail chain had already been developing an application that allowed customers to order online and then pick up their groceries, the explosion in demand indicated that a monolithic software platform designed to manage the service was woefully inadequate.
“We started to run into challenges as we scaled up the business: struggling to deliver quickly and lots of risk associated with change, and ultimately, reliability issues,” Justin Turner, senior engineering manager for H-E-B, said. “So, that put us on our modernization journey to break into microservices.”
During their talk “How H-E-B Curbside Adopted Linkerd During a Pandemic,” Turner and H-E-B Senior Software Engineer Garrett Griffin described how their DevOps team opted for Linkerd to support their observability and other needs to run and deploy applications and updates for the curbside grocery pickup application.
“The hypothesis was that a [service mesh] would help us solve some of the challenges we were running into as we built our microservices and the operational muscles around supporting them,” Turner said.
For testing, the DevOps team used Linkerd to verify what would happen as they scaled the pods to hundreds of instances, to see if the proxy would still be injected. “While doing this [test], we ended up discovering a connection issue completely unrelated to Linkerd,” Griffin said. “However, we were still able to remediate it and keep on going.”
To help check for potential traffic distribution issues when scaling service pods, Griffin said, the DevOps team was able to determine through testing that it would be possible with Linkerd and other tools for traffic to be directed without scaling interference, for example.
“Our big fears were around the proxy injector and the control plane,” but in the event when “they were not able to scale, or deploy new versions, they were still working,” Griffin said. “This brought us relief in that we knew we would have time to fix the issue, if this were to occur in production.”
Interoperability Is Key as Service Meshes Come and Go
Organizations will likely look to use at least more than one API service layer and service mesh for their clusters. This is why interoperability, and thus specifications, are critical for control planes as well. During his talk — “Service Mesh Specifications and Why They Matter in Your Deployment” mentioned above — for example, Calcote, asked rhetorically:
“How many specifications, how many standards are there that have come to the rescue, so to speak, for understanding and interoperating with the various service meshes that are out there?” Calcote said.
A service mesh can be used for testing router performance, service latency and other variables. However, determining service mesh performance in an apples-to-apples way can be challenging. When studying “published results from some of the service meshes [from providers] that do publish results about performance… what you’ll find is that they’re probably using an environment that isn’t necessarily like yours,” Calcote said. “They’re also using different statistics and metrics to measure [their service meshes] … and it doesn’t help.”
Service mesh performance (SMP) was created in an attempt to establish a way of comparing the performance of different services. “The SMP was born in combination with engaging with a few of those different service mesh maintainers and creating a standard way of articulating a performance of a mesh,” Calcote said.
Among the variables in consideration, in addition to the service mesh itself, include the number of clusters, workloads, the types of nodes, control plan configuration and the use of client libraries all affect performance.
“What costs more, what’s more efficient and what’s more powerful: These are all open questions that SMP assists in answering in your environment,” Calcote said.
KubeCon+CloudNativeCon is a sponsor of The New Stack.