Is Kubernetes the Next Fault Domain?

These days, most application architecture is distributed by default: connected microservices running in containers in a cloud environment. Organizations large and small now deploy thousands of containers every day — a complexity of scale that is almost incomprehensible. The vast majority of organizations depend upon Kubernetes (K8s) to orchestrate, automate and manage all these workloads.
So what happens, then, when something happens with Kubernetes?
A fault domain is the area of a distributed system that suffers the impact when a critical piece of infrastructure or network service experiences problems. Has Kubernetes become the next fault domain?
Contemplating the disaster of a Kubernetes-related application failure is the stuff of DevOps nightmares. But in disaster, there is also opportunity: Kubernetes has the potential to help us have a common operating experience across data centers, cloud regions and even clouds by becoming the fault domain we design our high availability (HA) applications to survive.
Kubernetes as Common Operating System
Many distributed applications need to be distributed as close to users as possible, so let’s say we want to build a three-region cluster.
Without Kubernetes, even in a single cloud, that means managing all these virtual machines and setting up a bunch of scripts on each server to self-heal. If the server gets shut down or restarts, we have to write a bunch of Terraform or Ansible (or Puppet or Chef or Pulumi) scripts to regenerate our servers.
Then, if we want to be cross-cloud, we have to do all that stuff three different ways! We’ve gotta know the AWS way of doing it. We’ve gotta know the Azure way of doing it. We’ve gotta know the Google way of doing it.
Using Kubernetes, though, the only thing we need to know that’s specific to AWS, Azure or Google is how to get at a Kubernetes cluster … and then how to configure that cluster to be able to provision infrastructure and to be able to communicate with each other, whether that’s via private networking, a VPN, or TLS over the Internet. Once that’s done, the rest of our administration work is largely the same, regardless of where our infrastructure lives.
Kubernetes effectively gives us a common operating system, regardless of where we’re running infrastructure. It’s acting as our OS and abstracting away the complexities of whatever availability zone or region or cloud that we’re running on. We have a common operating language regardless of where we’re deploying it, and we get all the great self-healing capabilities of Kubernetes.
This is great, but it is also how Kubernetes becomes the fault domain: Because the perimeter of our K8s cluster is now equal to the perimeter of the infrastructure that we sit on top of, we can treat each Kubernetes cluster as if it were a data center or cloud region for HA purposes. So if either the region or the Kubernetes cluster fails, our applications handle that failure the same way.
By making them equivalent, we reduce the number of dimensions that we have to manage from an availability perspective. This dramatically simplifies the distributed application landscape because it becomes the only fault domain that we have to think through.
The problem is that Kubernetes isn’t really designed to be treated as a fault domain.
The Next-Generation Problem for K8s
This is the next big problem that we now need to solve: For us to be able to easily treat Kubernetes as the fault domain for multiregion/multisite clusters, Kubernetes itself needs to provide a number of additional constructs to facilitate this pattern.
The K8s ecosystem and community have been chipping away at this problem for quite a while. This has led to various different ways to purpose build multiregion solutions for a particular application or application stack, but there is not yet a single unified strategy or solution to this problem area.
The most significant of these bespoke solution areas are networking and security, but there are also needs in the area of infrastructure, failure recovery, observability and monitoring. Networking is crucial, because of the need for connectivity between clusters and then service discovery and traffic routing between those clusters. And security matters because you need to make sure you don’t have access sprawl, and you need a central trust authority.
Networking
Currently there are cross-cluster communication platforms like Cilium, Tigera’s Project Calico, Submariner and Skupper. Each has pluses and minuses, but none of them seem to be the one-size-fits-all solution that you’d hope would exist.
Load Balancing and Service Discovery
Once the clusters can talk to each other, they need to be able to discover instances of different services running cross-site. And there’s a need for global load balancing that allows users to be routed to the closest available instances regardless of where they enter an application.
Security
If you’re administering a K8s cluster, generally speaking, you are going to have pretty low-level security permissions. You effectively need to have the same level of security permissions in each cluster, and if you’re not really careful about managing it, you can end up having more or less security than you need in a particular cluster.
Unfortunately, security management in K8s is often still a pretty manual process, and this becomes more difficult the more distributed your application gets.
Trust and Identity
Right now, sharing a single trust and identity source across multiple Kubernetes clusters is a somewhat painful exercise, which exacerbates some of the other security issues you might run into while running an application across multiple Kubernetes clusters. This becomes even more important when you have interaction between pods across sites, where an administrator may need to be connected to multiple Kubernetes clusters concurrently for troubleshooting purposes.
Infrastructure and Performance
Currently, the Kubernetes primitives that provision pods only let you declare “how much” of something you get, without consideration for how performant that infrastructure is. For example, you can ask for a CPU, or a volume of a specific size, but you can’t request a particular processor or guarantee the performance characteristics of a drive. This means each site has to be carefully tuned and monitored to make sure you don’t have performance hot or cold spots.
Failure Recovery
No matter how distributed a system is, there’s still the chance of a disaster that an environment wasn’t designed to survive. A disaster recovery strategy to mitigate this possibility requires applications to reach outside the failure domain to store and retrieve backups and this is a non-trivial activity both in Kubernetes and for the clouds in general.
Observability and Monitoring
When running multisite applications, it’s important to be able to monitor the health, performance, and behavior of the entire system in a way that allows administrators to intervene before problems occur and do capacity planning to manage increased demands on the system.
We are highly aware of this at Cockroach Labs, where Kubernetes is key to CockroachDB and our managed database services. Here is how we use Kubernetes as the fault domain for multiregion clusters or multisite clusters.
Landing the Control Plane
You may not think about the application you’re building as global, but it is. A deployment across two or three sites has the same challenges as a planet-spanning multiregional deployment, so the application must be built with the same architectural primitives.
Unless you have an extremely localized business model, you’re going to be building this way, either now or in the near future. What’s amazing about this is the more distributed an active-everywhere workload becomes, the less expensive it is to survive any particular failure.
For example, in a traditional two-site disaster recovery scenario, you have to have 2x of everything to be able to continue to operate if you have a data center or region failure. With CockroachDB distributed across three sites, you only need 1.5x of everything to be able to operate without disruption. This is because even after losing a site, you still have two more remaining.
The real cost of surviving a site failure goes down even more as you spread an application across additional sites. For example, when spread across five sites, you only need to provision 1.25x the amount of infrastructure to be able to continue operations undisrupted in the case of a site failure.
When you’re deploying multiregion/multisite, we recommend a Kubernetes cluster for each site and then we span CockroachDB across those sites. Which is where, lacking this next-generation solution, we had to do a bunch of custom stuff for our managed CockroachDB service to be able to treat Kubernetes as the fault domain.
We solved this in CockroachDB Dedicated and Serverless by building a control plane to manage this for us. We use the networking in either Google or Amazon to allow for routing between Kubernetes clusters in different regions, and then we use the control plane to apply all of the security settings consistently and check to make sure they get updated. It also provides us with the kinds of observability information we need to support hundreds of clusters concurrently and help customers troubleshoot issues they might be experiencing.
The control plane does other things, as well. We created a centralized key management store so administrator keys don’t have to be discreetly shipped to each separate region. We’ve also spent a lot of time thinking about persistence.
Of course, what we’ve built is custom for CockroachDB, just like what anyone else would have to build today to manage these kinds of edges.
As we were initially building the CockroachDB database itself, we talked about Kubernetes constantly, as even in a single-site configuration the marriage of CockroachDB and Kubernetes gives the database even better resilience characteristics than just the database alone.
These days, if you look at our website, we don’t mention Kubernetes nearly as much. But internally, it’s still totally top of mind: all of CockroachDB Dedicated, all of CockroachDB Serverless, and a number of our self-hosted clusters all run in Kubernetes. It’s just that the control plane handles the complexity.
Conclusion
Hybrid, multiregion and even multicloud deployments are becoming not just increasingly common, but also increasingly necessary for businesses needing to scale horizontally, guarantee availability and minimize latency. Kubernetes has the potential to help us have a common operating experience across data centers, cloud regions and even clouds by becoming the fault domain we design our HA applications to survive.
We believe that the best way to do that is to have a Kubernetes cluster in each location, and then have some sort of shared mechanism to wire them up together effectively — to be able to share security configuration information and set up network routing and all of the other pieces that need to be put into place to solve this next generation problem.
Every Software as a Service company on the planet, and every multinational company as well, has this exact problem. At Cockroach, we see this every day because many of those companies and platforms are our customers.
Inventing the mechanism that allows Kubernetes to be distributed across multiple regions: This is the challenge now for the entire Kubernetes ecosystem and community.