The Hidden Costs of Service Meshes
It’s time: You know you need a service mesh. You have countless microservices and are planning to build out in a cloud native way, securing all the traffic within your cluster – awesome sauce. There are a bunch of service meshes on the market and most of them are free. You want the best one for your needs.
So, how much is that wonderful free service mesh going to cost you? The answer is far more complicated than you might think. Like many things in the cloud and infrastructure-as-a-service worlds, adding a new tool often adds complexity and costs, even if the tool itself is free. Further, deployment and operation costs for service meshes can be hidden, ballooning the bill once you’re already beyond rip-and-replace.
Many examples of unexpected inflation exists. For starters, if each new sidecar injection has more than a 1x impact on CPU requirements in the control plane, a common service mesh problem, costs can quickly escalate as you add new services because each new service requires individual sidecars. Or, let’s say you need to maintain API connections to monolithic applications running outside of the cluster. If your service mesh doesn’t have this capability, then you might need to keep a separate API gateway on top of the service mesh. That alone could double your costs, management time and risks. What if you rely on User Datagram Protocol (UDP) for any part of your application management and control? Unfortunately, most service meshes don’t support this. You might have to refactor your application and switch control methods to shoehorn it into the service mesh – yet another price tag.
Questions to Ask Before Choosing a Service Mesh
The obvious first question is: “Do I need a service mesh?” In many situations, when you only have a small number of microservices to manage or don’t have explicit security requirements such as end-to-end encryption, a service mesh may be overkill and unnecessary. (If you’re not sure whether you’re ready for service mesh, read How to Choose a Service Mesh for a six-point readiness checklist.)
Let’s assume you’ve passed that threshold and ascertained that, yes, you do need, and want, a service mesh. First, before you get too far along down that road, you should read a cautionary tale from the service mesh battlefield, when the team at HelloFresh embarked on their own deployment journey.
“While working with Istio over the last few months, we found that even small Istio global configurations can have major knock-on effects elsewhere upstream. In one instance a small global configuration change caused Istio Pilot to push new configuration to…more than 1,000 connected istio-proxies. While the proxies loaded this configuration, each consumed roughly 75mb more memory than usual. While 75mb doesn’t sound like much, when you have dozens of connected istio-proxies on the same worker node, this sharp temporary spike caused some of our Kubernetes worker nodes to be starved of available memory thereby impacting the pods running on the node. Luckily we caught this before deploying to a live environment.”
HelloFresh’s full article is well worth reading. However, it is from over a year ago, and Kubernetes and Istio continue to regularly improve, so many of these technical issues might be more simply resolved today. The key takeaway is this: Their service mesh choice had real cost implications, either in data and CPU usage, in workarounds and additional compute infrastructure, or simply in time spent troubleshooting and fixing. A reasonable question to HelloFresh would be: “Did you need your service mesh to manage all of your services everywhere and, if not, could some of the challenges you experienced have been mitigated by being more selective with your service mesh footprint?”
You can gauge the potential hidden costs of your own service mesh options by asking this series of questions:
1. How many container images does it take to run your control plane? How large does each image have to be?
The control plane is non-negotiable; you need it to operate a service mesh. However, the control plane architecture can have substantial consequences, particularly the number and size of container images required to run the control plane. Why does this matter?
- Number: The number of container images required directly relates to how much ownership over your cluster the mesh wants to take, which in turn directly scales costs.
- Size: Simply put, the larger the container, the more resources you have to throw at it.
An additional cost item could be the type of CPU required to maintain the service levels you need for your application. We don’t often think of node sizes or types for microservices, but a heavyweight app in a Kubernetes cluster requires more computational resources just like in the cloud or on bare metal. For development environments, mid-tier CPUs might be sufficient. For true production or even high availability environments, you might need top-tier CPUs.
Remember, your service mesh ultimately runs on hardware, or virtual hardware, and container images determine the requirements of that hardware.
2. What is the capacity of your Ingress controller for your service mesh?
Ingress controllers are commonly used in front of Kubernetes clusters to police, manage and shape ingress and egress (north-south) traffic. Although, like service meshes, Ingress controllers are not all created equal. Like traditional data center load balancers, some Ingress controllers have lower connections or requests-per-second capacities, which can mean more frequently tipping Ingress controller containers into autoscale at a lower threshold. Autoscale is the “break glass” of cloud native, and it can break the bank as well.
Yes, Kubernetes is explicitly designed to autoscale the service mesh along with it, but this should be a last resort. With autoscaling, you might be paying through the nose for spot cloud pricing on a regular basis. Those costs quickly add up. Of course, you could switch to a higher-capacity Ingress controller. Make sure to consider this limitation; while some are tightly integrated with each other, not all service meshes support all Ingress controllers or the way they integrate may add unnecessary latency.
3. Can your sidecar keep up with your service demand?
The more things change, the more they stay the same. Ironic, isn’t it? In the old world of applications in the cloud, or even monolithic apps, the speed of your proxy directly affected your application architecture. A slower proxy with lower capacity and less scalability might have meant you needed to deploy more proxies. A fast and scalable proxy was, and still is, better for high performance, low latency applications. In Kubernetes, the sidecar contains the proxy.
Some sidecars are written in languages that weren’t designed for networking or super-performant application delivery and traffic management, so there are large differences in performance capabilities, plus the propensity to tip over at certain traffic thresholds. In other cases, the team who wrote the sidecar possibly didn’t have much experience creating the bulletproof proxies able to handle web-scale traffic and ridiculous spikes in demand. The less performant the sidecar, the more sidecars and services you’ll need to keep up with traffic spikes and general load.
You could also be paying more than expected for a floundering proxy, which might be stopping traffic while still showing up as working fine in your Kubernetes dashboard. A floundering proxy could additionally introduce tail latency, which doesn’t show up in the averages, but can lead to unhappy customers.
For more on how your data plane can influence the performance of your service mesh, read Your Data Plane Is Not A Commodity.
4. Will you be running multiple clusters or multitenancy?
Multiple clusters can add up very quickly with a floor rate per cluster per hour. If you’re running many services, but want different teams to maintain control of their services, then multiple clusters will break the bank.
That said, multitenancy – isolating resources in larger clusters – has its own challenges. Putting more applications and developer teams into a cluster does look cheaper on the surface, as it allows you to better optimize usage of storage and CPU. Though multitenancy can exert a nasty tax on your DevOps and DevSecOps teams, along with the added complexity of forcing them to create and maintain additional functionality. For example, with multitenancy, you will likely need to create a segmentation between tenants at the namespace level while establishing role-based access controls (RBACs). Then, you must consider the potential risks you’ll incur if one team makes changes to networking, security or other configurations, thereby exposing all applications to threats that could bring down the entire cluster.
5. How many CRDs does your service mesh require?
In Kubernetes speak, CRDs are “custom resource definitions.” In this land of clusters and service meshes, a resource is an endpoint in the Kubernetes API that allows you to store an API object. A number of resources are included in the default Kubernetes installation, but Kubernetes also allows you to create CRDs for more custom or bespoke functionality.
Most service meshes require the creation of some CRDs in your Kubernetes environment to hook into your clusters and deliver functionality. However, too many CRDs is like too much code. Each CRD contains an API path to each version of the API. This could rapidly create cascading complexity if a service mesh requires many CRDs. Because the number of CRDs is often directly proportional to the number of features implemented by the service mesh, more CRDs can translate to more challenging configurations, often for unused features, along with maintenance of your service mesh and the applications it serves.
6. Do you need dedicated staff to run the service mesh? If so, how many people?
As mentioned above, complexity requires more management and more eyes on the tools. Requiring a dedicated staff person whose primary job is keeping your service mesh running smoothly could be a red flag for you. Considering the costs of good Platform Ops and DevOps talent, you will be adding a mid-six-figure expense to your budget right off the bat. This also goes against the entire idea of Kubernetes and service meshes “shifting left” to give application teams and service owners more control over their destiny. Lastly, it indicates that your service mesh choice is complicated, brittle and likely to become a single point of failure.
7. Does your service mesh choice lock you into a specific choice of software or cloud?
But wait: The entire point of Kubernetes is to allow you to move your containerized applications to any environment, right? Alas, this is probably more true for your Kubernetes layer than your service mesh. Each cloud that has a Kubernetes service with some dependencies makes moving applications between services somewhat challenging. Unfortunately, the dependencies at the service mesh layer tend to be even less forgiving and often come with heavily weighted opinions: how to integrate, or not, with your cluster; how much ownership to take of your entire cluster; whether or not to use new Kubernetes tools for repetitive functions, etc.
While most major service meshes adhere to the Service Mesh Interface standard known as SMI spec, there are differences from cloud to cloud in how they can, and should, be tuned for scaling, security tools and solutions, and treatment of observability. For example, some service meshes have very deep hooks for observability that require significant configuration and even data schema changes to port effectively to other clouds. A good practice is to try cataloguing all the types of lock-in and challenges to transportability you might encounter in a service mesh choice. Ideally, the fewer challenges, the more optionality you can preserve. If you truly need to build out to large scale in one or more clouds, you should be aware of the lock-in price you will pay in costs and flexibility.
Conclusion: Know the Hidden Costs
Everything above depends on your specific service mesh plans and the needs of your applications. For most teams, extensively test driving a service mesh prior to deploying in production is a sound idea, as is spending some time in advance defining the use cases you’re hoping to solve with your service mesh choice. While that’s an idealized workflow, over a dozen service meshes exist and testing them all could take a significant amount of time, so there’s no point in trying an additional feature you’ll never use. Ask your service mesh questions up front.
Doing a back-of-the-envelope cost benefit exercise on the top service meshes you are considering could provide eye-opening insights. Your finance, operations, HR and management teams will thank you later.