Using Chaos Engineering to Improve the Resilience of Stateful Applications on Kubernetes
Kubernetes seems to be winning, even though the evidence for the use of Kubernetes for stateful workloads is less clear-cut. On the one hand, only 5% of respondents to the Cloud Native Computing Foundation’s most recent survey report that they do not intend to use a Kubernetes storage project — clearly good news for stateful workloads and CNCF projects like OpenEBS. On the other hand, only 14% of respondents say they have used storage projects in production.
What’s holding up the production usage of Kubernetes for data? Users from Optoro, Arista, Comcast and many others have shared their generally positive stories about using Kubernetes for data. The potential and the interest level exists. What are the impediments?
My opinion comes from years of supporting hundreds of Kubernetes users — including our own SRE team, which runs a 24×7 SaaS application running Cassandra, the ELK stack and other stateful workloads on Kubernetes. Perhaps surprisingly, I have found that users who proactively break the system — including the underlying platform and the applications themselves — move into production more quickly and achieve better outcomes.
Chaos engineering results in more peace of mind, more resilient systems and processes, and accelerates the production use of Kubernetes for data.
In this article, I’ll explain why and how chaos engineering is helpful. Then in a follow-up article, I’ll give some practical examples of how to introduce chaos engineering into your organization.
Building and operating resilient apps is hard. This is especially true for stateful, distributed apps, which might depend upon multiple layers of infrastructure, networks and services — in addition to dependencies on different workloads and application components. Such applications can easily fall into the anti-pattern of a “distributed monolith”, accidentally depending on cloud services and storage systems and other components to a greater extent than intended or understood. These dependencies tend to emerge at the worst possible time, when you’ve strayed off the happy path and find yourself sliding down a cascading outage.
Kubernetes does not yet entirely replace database administration teams; because while it provides really good building blocks for running and managing containers, the complexity, care and feeding of these stateful applications remains imperfectly expressed in available Operators and CRDs (in part due the implicit dependencies mentioned above). In the meantime, the low entry barrier to deploying applications on Kubernetes sometimes results in teams getting in deep before they fully understand the risk of their operations.
For example, consider the case of a PostgreSQL cluster that is performing asynchronous replication with a significant network lag/latency and then the primary pod goes down. Without the sort of checks I’ll discuss later, a secondary pod could be promoted to be the primary via automated election; this primary election may result in the loss of several seconds of data. Even if you have instead delegated replication to your storage system, the block storage in itself doesn’t have a way to track a replication log and hence is not aware of a loss/corruption. Also, you may have introduced a dependency on a particular storage service or storage system.
While this risk is being mitigated to a large extent today using app-specific operators and using storage solutions that are inherently Kubernetes-native, such as OpenEBS, some challenges remain.
Of course, network latency causing a partial partition is just one of many points of failure an SRE might have to consider. A look at the below diagram broadly illustrates the various components that can fail or force an eventual failure (as in the case of noisy neighbors and the Linux OOM killer) in a Kubernetes environment; or even an intermittent failure which, like the pot, only boils when not observed.
This representation still doesn’t factor in faults at the filesystem level (corrupted blocks, anyone?) and — again — it does not take into account other silent dependencies, perhaps on a shared event bus such as Kafka or other shared services.
Nonetheless, despite all the above, there is good news. We have seen again and again from users with proper processes, technologies and organizational cultures that one can confidently operate Kubernetes as a data layer and thereby achieve benefits like cost savings and happier and more agile developers. In short, you too can be a success story!
Following are a couple of ways I have seen organizations achieve the necessary resilience and confidence.
1. Engineer Deployments to Correctly Tune Appropriate Parameters
Some specific pointers:
- Expose app-specific health endpoints that can be consumed by external health checkers.
- Err on the side of adding more readiness and liveness probes.
- Tune your resource limits to ensure that apps fall under the “guaranteed” list when push comes to shove (that is, eviction occurs on account of overall resource exhaustions on the node).
- Use namespace-level quotas as another method for resource management.
- Use topology-aware scheduling and anti-affinity policies to ensure that apps survive node-level failures across nodes and availability zones.
- Use cloud native and container-attached storage solutions, so that each stateful workload is provided with its own storage controller in order to persist data. This ensures that the storage is better aligned with Kubernetes operating principles and allows smoother upgrades while minimizing the potential blast radius.
- Use application and storage affinity, if possible (solutions like OpenEBS enable this) so that application replicas can consume storage locally without having to go through the network.
- Use specific labels for different roles/replicas, enabling operators and admins to know exactly what a replica does in the context of the app.
- Set up the right termination policies; you may want to tolerate certain taints, define node stickiness, and so on.
- Configure pre-stop and post-start hooks where applicable, to ensure failovers are more meaningful and are well-handled.
- Select the right upgrade strategies (on-delete, rolling, and so forth) based on the nature of the application.
- Pod and node-level auto-scaling enabled, but with storage provisioning and app considerations so that you avoid situations where a data rebalance can go on forever and introduce further issues.
- Monitoring and alerting hooks built into the app deployments.
2. Subject the Application and the Infrastructure Underneath to “Chaos”
As the above list makes clear, there are a large number of tunables or configurations for every Kubernetes deployment. Because your Kubernetes platform and the workloads running on Kubernetes are all changing, you cannot simply tune the environment and then set and forget it. Ironically that applies at least as much if you have outsourced the operations of your Kubernetes to your favorite cloud.
Here is where Chaos Engineering fits. Chaos Engineering can validate both your application’s failure-handling and can gauge the deployment resilience of the Kubernetes clusters and related infrastructure components — not just once, but frequently as a part of your deployments and also ongoing in production.
You can think of chaos experiments as a means to validate and discover your known-knowns (where the impact is predictable), known-unknowns (for example, results of known failures over a prolonged period of time, or the chain of events it might bring about over time) and unknown-unknowns (mostly worst-case scenarios or multiple-component failures that might not have been accounted for while building the app or while deploying it).
In the next article in this series, I will dive into “stateful chaos” and how Litmus — a cloud-native chaos engineering solution recently contributed to the CNCF by MayaData — helps you to run chaos experiments on stateful applications.