For My Next Trick, I’ll Make a Service Mesh… Disappear!
For this article, I’ll be doing my best David Copperfield impression — no, I won’t make the need for a service mesh disappear like the famous magician once did with the Statue of Liberty. But, like that illusion, we’ll look at things from a different perspective.
If you’re a fan of This American Life like I am, you may have heard this episode in which David Kestenbaum talks about his fascination with that specific trick and how it led him to try magic on his own. In that episode, they talk a little about how the trick was actually done (spoiler alert: the Statue of Liberty didn’t actually disappear). Instead, they used a rotating platform, clever camera angles and that live studio audience (yeah, they were in on it).
Now, if you’re still with me, you’re probably thinking, “Thanks for spoiling my sense of wonder, but what the heck does this have to do with service mesh?” Here’s the comparison I’m trying to draw here — we need to apply the same sort of misdirection and methodology into how we incorporate service mesh with our existing workflows. It never really disappears, but we’re just looking at things from a different angle. Just like that audience, we’re all in on the trick.
So without further ado and with nothing up my sleeves, let’s make the service mesh… disappear!
Service mesh should not be a “thing you do.” The reason: it’s not fair to expect application developers to be experts in service mesh in addition to everything else they manage.
Cloud Infrastructure Provisioning and Automation
What if, instead of treating a service mesh like something I deploy into an existing environment, I weave it into a core infrastructure provisioning workflow to build that network in the first place?
Let’s start with a familiar premise that is largely agreed upon — more and more companies want to automate and standardize infrastructure provisioning, both across cloud providers and on-premises. The massive number of downloads we see for HashiCorp Terraform — over 1 million per month — proves this point.
Terraform is probably the most widely adopted infrastructure provisioning tool. The most basic unit that it provisions is cloud infrastructure, but (as many Terraform users will likely tell you) Terraform is used for everything.
As a disclaimer, I’m going to use HashiCorp tools to illustrate my points in this article, but fully understand that these concepts will still apply to other solutions that you and your organization are using. Remember, it’s all about the workflow and the outcome we’re trying to solve.
In the Terraform Registry, you will find a provider (basically integrations and plugins), community or official, for just about any technology. The reason for this is the story doesn’t just end when I provision infrastructure. I have other technologies and tools that I want to incorporate to make the environment ready for use by development teams. Terraform files will grow in complexity, consisting of multiple providers and modules to build out more robust architectures, defined in code.
So how does this apply to service mesh?
HashiCorp Consul is a service mesh tool that provides both service discovery and secure connectivity between applications in both cloud and on-premise environments (there are other service mesh solutions that could be provisioned via infrastructure as code, but I wanted to provide you with concrete examples). From a Terraform perspective, I can use the HashiCorp Consul provider or Helm provider to deploy Consul into my newly created infrastructure and suddenly I have a repeatable and reproducible way to provision new infrastructure with a service mesh deployed.
My operators don’t have to learn a new workflow. In fact, I am still provisioning infrastructure the same way but just “rotating the platform” somewhat so we have the end view we want. Now, although this glosses over the complexity of building these solutions, investing building time once, gains me future value through automation. Further down, I’ll explain who responsibility for managing the complexity should fall to.
Zero Trust Security
When service mesh is discussed, you will often hear about how it improves security at the application layer. If you ask security teams about service mesh, though, they probably won’t show much interest. Service mesh is mostly thought of as a developer tool, but as we discussed in the infrastructure section, we can integrate service mesh into the security workflow as well. This narrative is already part of the conversation and you might have heard it referred to as Zero Trust Security.
The general premise is that I need to plan on having different identity-driven controls. These controls include:
- Human access and authorization
- Machine access and authorization
- Machine-to-machine access
- Human-to-machine access
For human and machine authentication and authorization, SSO products (Okta, Ping, Auth0, etc.) and tools such as HashiCorp Vault are great at securing these workflows. Rather than trusting users with the responsibility of rotating the important credentials that machines use to access other services (e.g. database passwords, certificates, or access tokens), Vault automates this process. Additionally, Vault takes a “no-trust first” approach to providing secrets and access to various entities. The rise of Vault and similar tools is being noticed by the market, defined as “Secrets Management,” the CNCF recently released a new radar on this category, showing a general shift away from relying on users to manage credentials to automation tools.
Authentication and authorization are about protecting the endpoints — which identity brokering and secrets management are great for — but that really isn’t true zero trust is it? Zero trust networking means that we’re also looking at what happens when that authentication occurs and protecting the actual communication data being exchanged between applications.
Getting to True Zero Trust with a Service Mesh
Some might say this is all overkill, but threats can come from anywhere. Just look at the recent hacks of Canva, Marriott or any of these other companies in Auth0’s 11 biggest data breaches of 2020. In each case, these breaches stemmed from compromised credentials and bad actors gaining access to the network.
When a service mesh is configured with a pattern of zero-trust networking, all network communications for newly registered services are denied by default until authorization is given in the form of a certificate or ACL token. Having that type of policy in place could have helped prevent those breaches by ensuring that even once the attackers gained access to the network, they would not have been able to gain the necessary permissions to access other applications.
If you’re going to create a true zero-trust environment, a service mesh will help you get there, but it needs to feel like a secret, supporting actor of the workflow rather than being the star of the show.
In the context of the tools I’ve mentioned, if I leverage Vault as my trusted identity source and root certificate authority, how am I distributing and rotating those certificates across my different applications? Paired with a tool like Consul, I can generate and retrieve intermediate certificates from Vault or create access control tokens, using both for authentication and securing the communication with mTLS all in one workflow.
Again, this is glossing over the complexities, but the idea is that I’m thinking about service mesh as a part of my security strategy and trying to achieve an outcome of encrypted and trusted communications anywhere.
Perhaps the biggest motivation to adopt service mesh is the way it can turn your network security, service discovery and progressive delivery practices (e.g. blue/green & canary deployments) into a self-service interface for developers. I agree with Forrester’s David Mooter, that “a service mesh should fade out of sight” in the eyes of the application developer, and a lot of the themes I’m covering here are based on what his article lays out.
The general idea is that service mesh should not be a “thing you do.” The reason: it’s not fair to expect application developers to be experts in service mesh in addition to everything else they manage. Just because some developers understand how to manage Kubernetes or HashiCorp Nomad, does not mean they have an implicit understanding of service mesh.
So how do we make service mesh available to developers without forcing them to become experts in it? Here’s how — we make the mesh part of the delivery process.
I think the key here is identifying the orchestration methods that the application teams are using. Once you do that, it’s typically a matter of deploying the mesh along with the environment and essentially scripting the necessary additions to the deployment files. At face value, getting an application added to the service mesh is really just a couple of lines of code. This can be a few annotations in a Kubernetes manifest or an additional resource block in a Nomad job.
name = "count-dashboard"
port = "9002"
destination_name = "count-api"
local_bind_port = 8080
Example of a Consul sidecar proxy getting added to a Nomad job via the Service block
The actual hard work is what comes before the delivery. Earlier, I recommended that you provision your service mesh alongside the environment. If you’re using Terraform to deploy a Kubernetes cluster with Consul, then really all the application engineer needs to do is ensure that the proper configs have been added. The process then stays the same and, if they are working off a shared repo, the manifest is consistent regardless of the changes to the code.
Don’t expect the developers to set up the service mesh before running their apps, just standardize the practice and ensure the environment is ready for them.
Conclusion: Who’s the Magician?
So in this disappearing metaphor, who’s the magician? The Statue of Liberty trick doesn’t happen without David Copperfield designing and choreographing the illusion. It’s the same with a service mesh.
A service mesh should fade out of sight in the eyes of the application developer.
I’ve talked about different workflows and how a service mesh becomes a part of them, but it’s important to remember that someone needs to be the expert. There needs to be someone who is familiar with the concepts, can help augment these workflows, and make the necessary changes. The biggest challenge we’re seeing in the adoption of service mesh is a lack of skills or resources to implement the tool. If you look at enterprise networks today, they just sort of work — developers, infrastructure and security teams leave the actual networking configurations to networking teams. When it comes to service mesh, there’s a similar expectation, but the owning team’s identity isn’t necessarily defined by traditional org structures.
Having an expert (or team of experts) to implement the mesh in a way that aligns with the evolving workflows of the teams they support is key to a successful service mesh deployment. Organizations should consider this as they build out their plans, but make sure that person or team is a part of these discussions. Just like how the Statue of Liberty doesn’t “disappear” without the equipment, the cameras and the audience all helping out, a service mesh can’t fade away without all of your teams working together.