Development / Kubernetes / Sponsored / Contributed

How a Little Chaos Can Make Kubernetes Much More Reliable

25 Mar 2020 12:00pm, by

KubeCon + CloudNativeCon sponsored this post, in anticipation of KubeCon + CloudNativeCon EU, in Amsterdam.

Andre Newman
Andre is a technical writer for Gremlin where he writes about the benefits and applications of Chaos Engineering. Prior to joining Gremlin, he worked as a consultant for startups and SaaS providers where he wrote on DevOps, observability, SIEM and microservices. He has been featured in DZone, StatusCode Weekly and Next City.

Disruptions are a fact of life in Kubernetes. Hardware failures, Pod outages and resource exhaustion threaten even the most well-managed clusters. And while Kubernetes provides some recovery capabilities, no system is perfect, especially one as complex and dynamic.

You don’t need to look far to find examples of Kubernetes failing catastrophically. In 2019, Grafana experienced an outage after accidentally preempting pods belonging to production deployment. That same year, Target upgraded its development environment’s OpenStack infrastructure, only for it to cause a cascading failure that automatically provisioned 41 thousand new nodes. Even something as simple as shutting down a Pod can cause problems, as Ravelin discovered when Kubernetes continued to send ingress traffic to Pods after sending them a SIGTERM.

With traditional testing, scenarios like these are hard to uncover. Kubernetes components can interact in a number of unpredictable ways, causing emergent behaviors. As deployments grow in size, so too do the number of possible interactions between these components. Site reliability engineers (SREs) need a new approach to testing the resilience of their clusters and the only way to do it is with a bit of chaos.

What Is Chaos Engineering?

Chaos Engineering is a disciplined and scientific approach to testing systems for failure. It provides a framework for SREs to verify the reliability of their systems, test recovery mechanisms and gain important insights into their applications and infrastructure. SREs can use Chaos Engineering practices to identify risks and possible failure points before they become production outages.

While “chaos” implies disorder and mayhem, Chaos Engineering actually defines a systematic and structured approach. The goal is to help teams understand how their systems respond to failure-inducing situations and causing random failures does little except put these systems at risk. Chaos experiments start on a small scale, with components that are non-essential and easily recoverable. Once you have a better understanding of your systems and their recovery mechanisms, you can scale up your experiments to test these mechanisms and ensure they work as intended.

With Kubernetes, it’s easy to make assumptions about how your systems will behave under certain scenarios. For instance, if a node runs out of resources, you’d expect Kubernetes to schedule new Pods onto another node. However, we can’t always trust these assumptions: Kubernetes might instead evict a running Pod, refuse to schedule the Pod due to node taints, or fail to connect the Pod to the service mesh. Chaos Engineering pushes these mechanisms to their limits so that you can observe their response and determine how to make them more resilient.

Which Chaos Experiments Are Useful for Kubernetes?

Chaos experiments can be used to simulate conditions leading to a failure, or create failures directly. Here are some scenarios to consider running against your clusters.

Simulate Load to Test Auto-Scaling Capabilities

One of the most effective experiments you can do is test your cluster’s auto-scaling capabilities. This ensures that your cluster responds quickly and efficiently to changes in demand without causing scheduling errors or evicting Pods.

For example, increasing the load on a Deployment should trigger the Horizontal Pod Autoscaler to scale up the number of Pods in your ReplicaSet. As the cluster approaches its resource limit, the Cluster Autoscaler should automatically provision a new node. If neither of these occurs, consider fine-tuning your Deployment configuration and autoscaling thresholds.

Inject Latency to Test Responsiveness and Upstream Impact

Latency can have a cascading effect on the performance of other services. Even just a 100ms delay in response time can block upstream Pods, cause timeouts and lead to application failures. Latency tests can help you identify the performance limits of your application, tweak your load balancing strategies and optimize your application and network architecture.

Fail Components to Test Replication and Recovery

Kubernetes can recover from most common component failures, but without testing this functionality, you have no way of knowing what will actually happen. Deliberately causing failure may seem counter-intuitive, but it provides definitive answers as to whether your recovery strategy is working as intended.

For situations where Kubernetes can’t automatically recover, injecting failure is an opportunity to test your disaster recovery plans. For instance, what happens if your unmanaged cluster exhausts its resources, or a master node goes down, or an engineer accidentally deletes a ReplicaSet? Having these experiences helps you become more adept at responding to high-severity failures when they happen in production.

Start Causing Chaos

The only way to test the resilience of your systems is by running experiments. And while experimenting in testing and staging can yield useful insights, these environments can never truly replicate production. Failing in production sounds like a worst-case scenario, but it’s the only way to really know how resilient your systems are. That said, there are ways you can test safely and one method is with the use of canary deployments.

Canary deployments let you deploy a new version of an application alongside an existing release. Kubernetes routes a small portion of production traffic to the canary before rolling it out completely. This offers the best of both worlds by giving you a production environment in which to run chaos experiments, but without placing your entire application at risk. If the canary can’t recover from a failure, Kubernetes can redirect traffic back to the stable deployment. Once you account for the failure and implement a fix, you can deploy an updated canary and repeat the experiment.

This cycle of experimentation, observation and implementing fixes will cause your systems to gradually become more resilient. Ultimately, injecting failure should have zero impact on your user experience, but the only way to reach this goal is by starting with small-scale experiments and increasing scope over time.

Building resilient Kubernetes clusters is challenging. Nothing’s predictable in production and failure is a fact of life. Chaos Engineering helps you stay ahead of the unexpected by letting you safely test failure scenarios, detect weak points, improve your recovery strategies and build greater resilience against outages.

To learn more about containerized infrastructure and cloud native technologies, consider coming to KubeCon + CloudNativeCon EU, in Amsterdam later this year.

Cloud Native Computing Foundation, which manages KubeCon + CloudNativeCon, is a sponsor of The New Stack.

Feature image via Pixabay.

A newsletter digest of the week’s most important stories & analyses.

View / Add Comments

Please stay on topic and be respectful of others. Review our Terms of Use.