LitmusChaos, a cloud native chaos engineering framework for Kubernetes, has joined the Cloud Native Computing Foundation (CNCF) at the sandbox level, making it the first chaos engineering project to join the CNCF. It is also the second project from MayaData to join the foundation, following last year’s donation of OpenEBS.
Pioneered by Netflix, chaos engineering is a method that tests a system’s resilience against network, application, and infrastructure failures by intentionally causing breakdowns in the production environment. In addition to ensuring resilience, tools like LitmusChaos can help ensure that applications systems are being built in a loosely-coupled manner that promotes resilience, MayaData CEO Evan Powell explained.
“What I saw chaos engineering being used for was, in a way, keeping the developers honest. You tell your developers to build in a loosely coupled way so that you can survive any failures, and they all say, ‘Yes, that’s what we’re doing, of course, but right now, I just need to push this out into production’,” said Powell. “Well, what chaos engineering does is it literally breaks parts of the infrastructure or the environment, or those dependencies. You said it was going to be loosely coupled. The fact that your system went down because you assumed this particular DNS server would be up, no matter what, that’s on you.”
LitmusChaos provides this service via custom APIs by way of Kubernetes custom resource definitions (CRDs). The tool is extensible, allowing integration with other tools and enabling the creation of custom experiments, which can then be shared on the ChaosHub, a centralized location for developers and Kubernetes site reliability engineers (SREs) to share chaos experiments as Helm charts.
MayaData co-founder Uma Mukkara explained that LitmusChaos experiments are “very, very granular,” which allows users to group them together into far more complex experiments and workflows.
“It’s completely declarative. Kubernetes developers and SREs want to be in control of what they can do with Kubernetes resources. What we try to do is define chaos itself as a custom resource inside Kubernetes. Developers and SREs can define declaratively where to use chaos,” said Mukkara. “You could do it at your infrastructure level, Kubernetes nodes, and other resources inside the node such as memory, CPU, discs, or you can go to the application level. It’s completely customizable.”
With the predefined experiments shared in the ChaosHub, users don’t need expertise or even to know specifics, he explained, but rather just introduce random chaos into the system. The hub has been donated to the CNCF alongside LitmusChaos, and Mukkara says that they hope that joining the CNCF will attract more partners from the ecosystem to create experiments and share with the broader community.
Popular charts already include experiments to kill pods and containers, hog CPU usage, and simulate full disks, with experiments able to be strung together, as Mukkara mentioned, via an integration with Argo, which joined the CNCF earlier this year. In addition, there are already project-specific collections of Chaos charts for OpenEBS, Cassandra, Kafka and CoreDNS, among others. Looking ahead, Mukkara said that adding to these collections is a priority.
“Our next step is to really collaborate with other communities and the Kubernetes community itself, to write more chaos experiments and workflows,” said Mukkara. “The product roadmap also is going to include a visibility and orchestration portal for Litmus. Monitoring is a very important piece of chaos engineering, so we are building that portal in the open with community-driven discussions.”
The Cloud Native Computing Foundation is a sponsor of The New Stack.
At this time, The New Stack does not allow comments directly on this website. We invite all readers who wish to discuss a story to visit us on Twitter or Facebook. We also welcome your news tips and feedback via email: [email protected].