Your Data Plane Is Not a Commodity
BMW makes fine automobiles. They are beautiful driving machines; this means everything about them is not cheap. Case in point: Replacement tires are $250 apiece on many models, and you have to buy them from the dealer. Those tires are usually the result of tight collaboration with tire manufacturers and are tailored to fit the drive of the car. Now, you can buy discounted generic aftermarket tires, but that would diminish your performance.
So if you are going to invest a ton of time, effort and engineering hours in a service mesh and a Kubernetes rollout, why would you want to buy the equivalent of cheap tires – in this case, a newer and minimally tested data plane written in a language that may not even have been designed to handle wire-speed application traffic? Because, truly, your data plane is where the rubber meets the road for your microservices. The data plane is what will directly influence customer perceptions of performance. The data plane is where problems will be visible. The data plane will feel scaling requirements first and most acutely. A slow-to-respond data plane will slow the entire Kubernetes engine down and affect system performance.
Like tires, too, the date plane is relatively easy to swap out. You do not necessarily need major surgery to pick the one you think is best and mount them on your favorite service mesh and Kubernetes platform, but at what cost?
A Checklist to Guide Your Data Plane Decision
As you build a Kubernetes environment and pick a service mesh to power it, here is a checklist to guide you. We cover key questions you might want to consider in adopting a data plane (e.g. ingress controller or reverse proxy solution) to incorporate into Kubernetes and ensure you are getting the performance you need and want. More than anything, ask yourself whether your data plane can handle all the performance your service mesh and applications are going to throw at it.
1. How Many Years Has the Data Plane Been in Service?
There is a good reason why software such as Linux and MySQL are the most trusted for running applications in production. They have years of service delivering applications at massive scale in a wide variety of challenging environments. Your data plane should ideally have more than a decade in service. Granted, that pre-dates a lot of Kubernetes deployments. The reality is, however, that the core of a data plane is performing the same work load balancers and reverse proxies have done for years: serving, shaping and securing HTTP traffic. So there is no reason not to stand on the shoulders of giants and deploy core technologies that have stood the test of time and are trusted to run in production by large organizations for the most critical applications.
Even if you are not initially running mission-critical workloads in Kubernetes, you will inevitably get there some day. Having a data plane with enough service time to be battle-tested is key.
2. What Is the Capacity of the Data Plane?
Most data planes publish capacity data for the number of connections or transactions per second, along with resource consumption requirements. This should be table stakes. If the capacity is on the low side, you know you will need to swap out your data plane as you scale for one with higher capacity. Alternatively, you will need to architect a system that scales horizontally. This can increase costs and complexity. As for those capacity numbers, take those with a grain of salt. Basic capacity numbers may tell only part of the story, and often initial resource usage may increase exponentially under load, nullifying those attractive low CPU numbers as soon as the data plane begins handling secure traffic.
To get a real sense of how a data plane will perform in your specific environment, you need to test it at scale, and you need to talk to other people who have used it in production. Sometimes performance degrades well before capacity thresholds are hit. So, check for capacity validation tests, but also verify for yourself that the data plane can deliver as advertised.
3. Does it Have the Integrations You Need and Want in the Future?
Your service mesh, Kubernetes environment and applications are living, breathing systems. You also are likely to need different types of applications as you grow. Different teams may choose different languages, data stores, application servers and frameworks to deliver their applications. Data planes with few integrations will limit those choices. That can box in your teams in uncomfortable ways.
For example, if a team wants to use a graph database to limit the number of API calls required, then it will need a data plane that can easily support GraphQL in most instances. Understanding how easy it is to add integrations and whether the company that supports the data plane or the community is innovating on integrations at a rapid pace is a good piece of data for calculating the ability of a data plane to continue to meet your needs in the future.
4. How Does the Data Plane Instrument and Provide Observability?
Downtime happens; it is inevitable. The real question to ask is what’s the cost of application downtime and how does your data plane handle catastrophic failures? In a development cluster, the cost of failure may be minimal to none. Downtime in a production cluster can become very expensive very quickly, more so if your platform can’t provide ready access to insight and observability across the platform. It can affect other applications, take customers and key business processes offline and lead to security breaches. At the data plane level, it is critical to have the visibility tools to troubleshoot downtime in near real time so you can trace outages and root causes quickly. Just as important is the flexibility to access application and service data quickly and easily without requiring someone who is a master at parsing log files or time series.
5. Can Your Data Plane Dynamically Recover From Catastrophic Failures?
To recover quickly, a service mesh must be able to respond dynamically to infrastructure and application failures. Dynamic is a relative term in the Kubernetes world. We rely on Kubernetes to be our control plane for resiliency, but some things are out of Kubernetes’ hands, such as pod time to availability. In some clouds, dynamic may mean a restart in five minutes. In others, the restart is faster. In Kubernetes, we tend to measure dynamic restart times in seconds. So make sure you understand the actual capabilities of your service mesh and the underlying cloud to quickly recover from failures. The reverse proxy sidecars are arguably the most important component in your mesh. Like those discounted tires, if sidecars go down, your service-to-service traffic stops. All applications running in your clusters will stop working.
Do you trust your sidecar proxy to provide enterprise resiliency and recovery? If the answer is “I am not sure,” you are taking a huge risk with your choice of service mesh.
Conclusion: Your Data Plane Matters, So Choose Accordingly
None of this is to say that other parts of a service mesh are not essential. If you hate your control plane, and it does a poor job of helping you manage and understand your service traffic and application topology, then your service mesh is not a happy place, either. That said, many people who are considering service mesh choices might want to give equal consideration to the part of their service mesh that touches customers and other services more than anything else. That is the data plane. Putting the right data plane on your service mesh will give you the BMW-like performance you need and a service mesh that is snappy, responsive, reliable and road-worthy in all conditions.