How Chaos Engineering Can Drive Kubernetes Reliability
There’s no doubt that downtime is one of the biggest risks to your company’s success. No system is foolproof, so it’s infinitely better to break your system before it breaks your company (and ideally long before putting it into production.)
This is the entire goal of systems resiliency engineering (SRE) and it’s what drove Netflix to start investing in chaos engineering, to purposely — and mostly using automated systems — wreak widespread and zeroed-in confusion, disorder, and destruction on distributed systems. Since then, folks have made careers out of unleashing controlled chaos at scale to drive business and software stability.
“Ask questions about your system’s behavior under certain conditions and enabling you to safely try it out live so that you can, collectively with your team, see if there is a real weakness and learn what the right response should be.”
So far The New Stack has talked to Loki-like chaos forerunners about the benefits of chaos engineering and the ways to measure its benefits to the business side. Today we are going to talk about the most popular use case of chaos engineering — to shore up your Kubernetes deployments.
How Chaos Engineering Is Driving Kubernetes Reliability
Kubernetes is the market-leading orchestrator for Docker, and in the Kubernetes and the cloud-native world, there are a variety of Kubernetes-native tooling being developed now to service its seemingly boundless popularity. One major area is in chaos engineering, which is gaining so much traction that one of the newest chapters of the Amazon Web Services Workshop is about applying chaos engineering to Kubernetes.
However, as its name suggests, chaos engineering for Kubernetes is a bit all over the place. There’s a growing demand for a natural cataloging of the field with a Cloud Native Computing Foundation (CNCF) chaos engineering working group being bootstrapped, in part, to help map out the field of tools.
“Not many tools integrate with each other as well. We need a community to unite them,” CNCF’s Chris Aniszczyk told The New Stack. “The chaos community is still very nascent. Kubernetes now is the distributed system kernel running on all clouds, public and private — and you need an external actor outside Kubernetes.”
This bootstrapped group also hopes to raise awareness among the tools so they can hopefully work together to improve integrations toward more usable chaotic workflows.
Aniszczyk said that “Cloud native technologies like Kubernetes empower organizations to build and run scalable applications in modern, dynamic environments. These techniques enable loosely coupled systems that are resilient, manageable, and observable. They allow engineers to make high-impact changes frequently and predictably with minimal toil.
“Chaos engineering is the discipline of experimenting on a distributed system in order to build confidence in the system’s capability to withstand turbulent conditions in production. In CNCF, we believe that proper chaos engineering practices are table stakes in building truly cloud-native systems and support the furthering of the discipline through investigating working groups and projects.”
Aniszczyk continued that it’s about enabling a series of commonplace tests and scenarios that will actually create a failure, so engineers can understand what could happen, if and when things go awry.
We’ve already covered how Gremlin’s Failure as a Service is being used to break things on business purpose, but now let’s dive into another chaos-first tool, Litmus, which is zeroing in on the topic of stable storage.
Storing and Securing Kubernetes with Chaos
When Evan Powell and Uma Mukkara started MayaData, it was with the purpose of creating end-to-end data agility for Kubernetes because “real problems come around the eruption of stable applications and databases.”
The MayaData tooling set includes the container-native, open-source OpenEBS for provisioning, storing, and protecting the data around Kubernetes, with data as the foundational layer of the stack. The company also created MayaOnline as the observing and cross-panel control piece.
“You stored and protected the data but I really want to see what’s happening on the application,” Mukkara said. “Each app is off of bare metal and now on tens if not hundreds of containers,” which even more necessitates the data has to be observable.
“Our users can connect those Kubernetes clusters to see how they’re doing,” he continued. “Enterprises are having applications in multiple clouds, so we wanted to give one control panel where they can see everything in one place.”
For example, what’s the total use of storage for one company, across Amazon Web Services, Microsoft and Google?
The MayaOnline cloud-based platform for data operations then exhibits how your stateful applications are doing along with predictive analytics to help operations teams understand and control their data better.
But what if one of those storage sources fail? The MayaData team built an in-house tool that they later realized had public use cases. In May this year, they launched it publicly an open-source framework called Litmus, a test tool that looks at validating and hardening containers with an emphasis on:
- Reliability — including negative, fail path, and chaos
Litmus is aimed at a variety of audiences:
- database developers using the chaos API.
- DevOps to test for chaos before and after deployment.
- CIOs and software VPs who want to see results.
- Storage providers.
- database vendors like Mongo and Cassandra.
Mukkara says that it “introduces chaos and measures the observations and results and puts them nicely into how their application is behaving,” validating and hardening the application during production and even sometimes later.
He went on to describe the MayaData toolset as “a full solution for users running stateful apps on Kubernetes.”
How does this system better address stateful applications that run on Kubernetes? Mukkara offered the example of one of the company’s e-commerce giant customers.
“One of our users has a huge IT system that’s being moved to Kubernetes and the main problem is: ‘How do I get my databases into a more manageable CI/CD [continuous integration/continuous deployment] platform that’s using containers?’ They have more than 400 developers in the system and what they really need is a way to quickly get a snapshot of the database and to test the changes against the data.”
Only when code changes are marked as high quality can be they be pushed into production.
“They can write for example — there’s a 12-node Kubernetes cluster and because they integrate with Litmus, they can pull down a node and see if the application continues to run or not or introduce more chaos like pull down a network while an application is running on the CI/CD pipeline,” Mukkara continued.
This application of these containers tooling solved the user’s problem of moving its monolithic application into a Kubernetes-based and microservices-heavy system stack.
“And they are able to make sure that the data is real in the testing, and they are able to get the code changes tested with real data and run through their CI/CD pipeline,” with Litmus and MayaOnline.
OpenEBS automated this process, enabling the DevOps person to go and suggest a certain provision basic on better data, time-saving needs and DevOps staff.
With these MayaData tools, Mukkara says “a couple of people should be able to manage a system against ten to 15 guys on these big systems.”
In addition, with OpenEBS the user was able to cut down on the time it takes to get a snapshot of the data as well as the time to test. Instead of a couple days, it can start immediately, empowering the developer to kickstart her or his own pipeline for testing almost instantly.
Everything in the MayaData toolset is about building storage, data protection, and visibility around storage operations and APIs. Mukkara says this all comes back to the eagerness for more chaos engineering tooling.
“Everybody is trying to adopt a new technology and they want to be really sure,” he said.
It’s all about a renewed focus on releasing stable applications that you can be sure will stay that way.