Containers / Development / Kubernetes

LitmusChaos 2.0 Expands Resilience Testing on Kubernetes

16 Aug 2021 6:00am, by

Cute bird experimenting next two a rocket shooting off, with the Litmus logo

For Umasankar Mukkara, CEO of ChaosNative and co-creator and maintainer of LitmusChaos chaos engineering platform, the trend of chaos engineering is being able to willfully engineer faults and identify whether there was an issue in the first place.

LitmusChaos, which was the first chaos engineering project to join the Cloud Native Computing Foundation‘s Sandbox program, applies a programmable, declarative approach to chaos testing, with steady-state hypothesizing, and Litmus probes at different stages of an experiment.

This week, version 2.0 of the LitmusChaos has been released for general availability (GA).

Mukkara said that Version 1.0 was about bringing chaos to the open source world. The company created an operator to support Kubernetes and the open chaos experiments in order to build a community.

The purpose of Version 2.0 is to make chaos engineering more efficient for both individuals and teams, and specifically enabling scalability. “It was never about one single experiment for lots of people. It was about putting lots of different experiences together. Chaos workflows rather than chaos experiments,” Mukkara told The New Stack.

Once adoption was stable, 2.0 became about expanding the platform features to be automatable at scale, based on defining, testing against and measuring outputs from the steady-state hypothesis — which in turn makes the automated chaos experiments more efficient. This allows for an increased set of Prometheus metrics with added filters to be used for instrumenting application dashboards for better observability.

In another step toward efficiency, the platform now allows users to deploy Litmus workflows in their Kubernetes namespaces. The Litmus team observed that many developers were trying to use the same Kubernetes clusters, managing their applications in their namespaces, so the second GA release had to solve the issue of the management of Litmus in multitenant experiments.

So far these early users are most excited about the new chaos workflows that give each developer independence while also allowing users and teams to run multiple experiments together, including:

  • automation of dependency setup
  • creation of complex chaos scenarios with multiple faults either in sequence or in parallel
  • definition of load and validation jobs with chaos injection
  • flexibility to create and run workflows in different ways, via templates, from an integrated hub, and with custom uploads

“We all know that chaos is culture. People are more open to consider chaos engineering in their DevOps practice now,” Mukkara said, which is why it’s becoming a multi-functional, cross-organization endeavor.

But this doesn’t mean starting to test in production. Only about 20% of LitmusChaos is used in production — and that’s after some time. He says that most chaos experiments are run, at least initially, in the QA or testing environment, where they are freer to inject faults at a deeper level. One way to reach production chaos is by involving more people in understanding chaos engineering best practices, including game days.

The beta version of LitmusChaos 2.0 has been out for about six months now, allowing the community to thoroughly test and provide feedback. The last month has been just about updating the docs.

Early adopters of this new version include Anuta Networks, a network automation SaaS platform, Orange, the French telecommunications company, and Lenskart, the Indian eyeglass e-commerce giant.

Version 2.0 supports the setup of dedicated control planes, chaos agents and execution of chaos experiments in both cluster-scoped and namespace-scoped modes to help operate in shared clusters with a self-service model.

Finally, Litmus expands beyond Kubernetes to allow injecting of chaos into infrastructure including virtual machines, instances, and cloud disks on Amazon Web Services, Google Cloud Platform, Azure, and VMware, irrespective hosting Kubernetes clusters. You can even introduce chaos experiments to bring down bare metal nodes that provide Intelligent Platform Management Interface (IPMI)-based out-of-band access.

Litmus 3.0

What’s on the horizon for Litmus 3.0? If 1.0 was about community and 2.0 was about releasing popular features, the next stage of the roadmap is about contributing many more chaos experiments on the ChaosHub. They are currently working with the community to clarify what are the most commonly used experiments to add to the application level of the open source Chaos Hub.

In the future, there will also be GitHub Actions integrated into the platform.

Mukkara asks that you go to GitHub to play around with Litmus 2.0, and tell the team what you think on its Slack community

What’s next for ChaosNative, the organization behind Litmus? Like many companies, it is still trying to find the right open source business model — noting that open source cannot not directly be a business model. The company provides commercial support to enterprises. In the future, they are looking to follow an open core model, within the clear guidelines of Apache License, Version 2.0, where they will build some enterprise SaaS tooling on top of it.

The second annual Chaos Carnival, a free event for cloud native chaos engineering, will take place on Jan. 27 to 28, 2022. The Call for Speakers is open now through the end of October.

A newsletter digest of the week’s most important stories & analyses.