Service meshes have been getting quite a bit of attention, and with good reason. By providing reliability, security, and observability at the platform layer, service meshes can play a mission-critical role in Kubernetes applications. But tales of adoption are mixed: some practitioners report shying away from adopting a service meshes due to their apparent complexity, while others report getting them up and running with apparent ease. So which is it? Are service meshes too complex to be worth the effort, or ready for adoption today?
In this article I wanted to focus on Linkerd, the Cloud Native Computing Foundation service mesh (and category pioneer) known for its emphasis on simplicity. In an increasingly crowded service mesh landscape, Linkerd is unique both for this less-is-more approach as well as its use of a dedicated, Rust-based “micro-proxy” at the data plane layer. The Linkerd website lists quite a few organizations running it in production, so I set out to talk to some of them and hear what their experience has been like. Why Linkerd, and how did it compare to more complex service meshes such as Istio, the current market leader in this space?
For this post, I interviewed two DevOps professionals from two different organizations. We discussed their journey running Linkerd in production and gleaned some interesting takeaways in the process. David Sudia is a Senior DevOps Engineer at GoSpotCheck, a mobile task management app for field service teams, where he leads the Ops team. David gave an insightful keynote at KubeCon+CloudNativeCon NA 2020, “More Power, Less Pain: Building an Internal Platform with CNCF Tools.”
We discussed their service mesh journey thus far, and they were happy to share their experiences. While we spoke separately, their stories have a few striking similarities. They both:
- Had similar requirements from a service mesh
- Tried Istio first but found it to be overly complex
- Stumbled upon the Linkerd booth at KubeCon and have been converts ever since.
Now, let’s get into the details of their service mesh story.
Being the most widely known service mesh, both tried Istio first. However, they quickly found it to be overly complex and challenging to use on many fronts.
Sudia recalls the setup requiring multiple Helm chart installs and various manual steps to deploy it into the cluster. The process took over a day — a big drawback for Sudia and his small Ops team, who support much larger Dev & QA teams. They didn’t have the time to “manage” a service mesh tool. He does point out that Istio recently has taken steps to simplify its architecture and make it more focused.
Andersen’s first attempt to install Istio on a test Kubernetes cluster broke the cluster. He had to rebuild it from scratch. After finally installing Istio, he wasn’t too impressed with the metrics it provided off the bat. The UI seemed outdated as well, and he almost gave up the idea of using service meshes altogether.
Coincidentally, both were at KubeCon+CloudNativeCon 2019 and stumbled upon the Linkerd booth. They liked what they saw and decided to give Linkerd a try.
That very night, Andersen went back to his hotel room — he was eager to give Linkerd a try. He installed Linkerd on a dev cluster and, to his surprise, got the first instance up and running with just a single command. He added a Linkerd proxy to a Kubernetes namespace and, in a matter of minutes, was able to see the traffic and communication between services.
Sudia ‘s experience was similar. The fact that Linkerd could be installed with the command line in a few minutes really impressed him, too. Unlike using Istio with its steep learning curve, Sudia felt he could play around with Linkerd and easily get a feel for it.
Within a week, Sudia ‘s team deployed Linkerd to a dev cluster. A staging cluster was up and running within week two. Prepared to react to incidents, it turned out to be a smooth and uneventful two weeks allowing them to deploy to production confidently. Sudia and his team found Linkerd to be intuitive and easy to get started with. Getting all the benefits of a service mesh without the complexity was a key decision when adopting the service mesh.
Linkerd User Interface
Sudia and Andersen’s primary motivation to adopt a service mesh was gaining observability into inter-service communication. Not only does Linkerd provide the right metrics, but it also visualizes them in an easily digestible way without any fuss.
According to Sudia, the dashboard is one of the best parts of Linkerd. No additional setup is required. The team just logs in and sees key metrics such as request rate, error rate, request duration, and total responses. And because the UI is so intuitive, he didn’t even need to write an onboarding process or schedule a training session; all it took was a quick walkthrough. From day two, the team was able to troubleshoot communication issues with accuracy. It was “one of the smoothest onboarding processes” they’ve had with any tool, Sudia claimed.
For Andersen, Linkerd’s “Tap” feature that traces requests between services stuck out. Seeing what’s going on in real-time without any additional setup was particularly convenient for him.
Just because there is a lot of talk around the complexity of one tool, it doesn’t mean the entire category is complex. When it comes to service meshes, you’ve got a lot of options. Istio’s complexity is probably due to the additional features it provides. Linkerd, on the other hand, took a minimalistic approach which translates into a lot more simplicity. There are likely use cases where it makes more sense to use Istio, and we know there are many happy Istio users. But we also can’t overlook all the complaints about its complexity.
In which group are you? What makes sense for your use case? When adopting infrastructure tooling, doing the extra research pays off. Don’t jump right into it. Identify your requirements and research your options first.
Development and QA teams
For Sudia and Andersen, the top requirement from the service mesh was the ability to observe service to service communication within their distributed applications. Not only did this benefit the Ops team, it also made the lives of their developer and QA counterparts a lot easier.
Sudia described that, without having to set up instrumentation for the most common metrics, their dev team can now simply “strip code out of their applications.” That’s because critical RED (rate, error, duration) metrics are provided by default. This also resolved another issue: developers often gave slightly different names to the same metric (e.g. ‘request received total’ and ‘total requests received’) separating metrics that should have been aggregated. All this allowed them to get apps out the door faster, and enabled teams to speak the same language.
Andersen saw the biggest benefits of a service mesh when running QA tasks. The ability to gauge load after deployments was particularly useful and greatly improved debugging and troubleshooting. Specifically, Linkerd’s tracing feature was great for this.
mTLS and Security Certificates
Security is a mission-critical aspect of software that must underpin every other decision. As such it was top of mind for both Sudia and Andersen. Both sought to adopt a service mesh to manage security certificates via mutual TLS to encrypt traffic inside clusters.
Sudia’s team typically uses cert-manager to issue Letsencrypt certificates and needed to have these certificates rotated every 24 hours. He wanted to avoid complex RBAC policies enforced on a per-container basis as handled by other service meshes including Istio. With a small team, the ability of quickly creating a highly secure cluster with mTLS was critical. It took Sudia’s team about 30 minutes to set up mTLS, and most of this time was spent on reading docs. This level of simplicity and ease of setting up mTLS is incredibly powerful, states Dave, especially for a small team like his.
Andersen’s team needed mTLS to securely route traffic between Linkerd meshed clusters. The fact that Linkerd provides automatic certificate generation, which is what they use for east-west traffic, came in very handy. For north-south traffic they use Nginx, their ingress controller. For his use case, Andersen wishes Linkerd had a built-in ingress controller, but admits that an ingress controller is “as big of a project as a service mesh” and understands why it may be best to separate the two.
Community and Support
Whenever Andersen or Sudia ran into issues, they found the Linkerd community to be quite helpful and were able to resolve any issues quickly.
At one point, Andersen had trouble with a caching service that wasn’t working with Linkerd when issuing HTTP sessions. On the Linkerd Slack channel, with the help of the community, he was able to figure out a solution and resolved it within a day. To his delight, a fix for this issue was included in the next Linkerd release, a great example of the project’s maintainers’ responsiveness.
Sudia’s team needed help when facing an issue with Ambassador’s tracing service. Within a day, he was able to find a resolution with help of the community on the Linkerd Slack. In particular, he appreciates the streamlined Linkerd documentation and attributes it to Linkerd being such a focused product, something that Istio has struggled with.
Not surprisingly, both Sudia and Andersen are big on monitoring even beyond Linkerd. They access monitoring data from multiple sources including Prometheus, Grafana Cloud, Elasticsearch, Rancher, Datadog, Jaeger, and SumoLogic. While their monitoring tool mix varies, they both are on a path of consolidating all their monitoring metrics into a single tool to gain a unified view of all metrics, logs, and traces.
I concluded by asking them for their view on the state of the service mesh ecosystem today and their thoughts on Istio. They echoed the view that Istio tries to do a lot of things, and while that may work for other organizations, they wanted something that was focused, flexible, and checks all the right boxes for them. That’s exactly what they’ve found in Linkerd. When it comes to Istio, they don’t particularly suffer from “FOMO” and their enthusiasm sharing their experiences with Linkerd speaks volumes about their support for the project.
I hope Sudia and Andersen’s perspective was insightful and that, by sharing their experiences, you may be better prepared when embarking on your service mesh journey. To hear more from Sudia and the open source projects his team has used to build a Kubernetes platform developers actually enjoy using, check out his keynote at this year’s KubeCon NA.
Disclosure: The author has done some consulting work with Buoyant, which manages LinkerD.
The Cloud Native Computing Foundation and KubeCon+CloudNativeCon are sponsors of The New Stack.