CI/CD / DevOps / Kubernetes

LitmusChaos Becomes a CNCF Incubator Project

12 Jan 2022 5:00am, by

Do you want to bring chaos engineering into your cloud and Kubernetes development? In short, do you want to beat up your applications in development before the real world of production gets its chance to knock them around? If so, you’ll be glad to know that the Cloud Native Computing Foundation (CNCF) Technical Oversight Committee (TOC) has pushed LitmusChaos from the CNCF Sandbox to the Incubation level.

LitmusChaos is, of course, an open source chaos engineering platform. It enables you to identify weaknesses and potential outages in infrastructures by inducing chaos tests in a controlled way. It does this by providing you with well-tested, highly tunable, and declarative chaos experiments.

With LitmusChaos you can chain tests either in sequence or in parallel to build a chaos scenario. The workflows themselves are declarative, schedulable, and browsable. You can also run workflow analytics.

LitmusChaos project

Chaos to the Entire Application

The idea of cloud native chaos engineering is to make certain every component responsible for an entire application’s proper functioning is resilient and capable of sustaining real-life turbulent conditions. LitmusChaos enables you to bring chaos to the entire application, rather than just a specific microservice.

It all started in 2017 as a project to provide simple chaos jobs for Kubernetes. It became a CNCF sandbox project in 2020 and today has maintainers from five different organizations across cloud native vendors, solution providers, and end-users. Since then, it’s gotten much bigger and better.

Today, LitmusChaos is already used in production by more than 25 organizations. These include large end-user companies such as Intuit, Lenskart, and Orange, and technology powerhouses such as Red Hat and VMware. Since the beginning of 2021, Litmus operator installations have grown from 50 per day to over 2,000 daily. In short, it’s already very popular and has gotten a proven track record.

Key to Building Robust Systems

As Chris Aniszczyk, CNCF’s CTO, observed, “Chaos engineering techniques enable organizations to cultivate reliability and robustness into their production environments. This practice will be key to building robust systems and LitmusChaos has already seen success among organizations looking to improve the resilience of their production deployments.”

Specifically, developers such as Jordi Gil, Red Hat senior software engineer, like LitmusChaos a lot. Gil said, “Litmus was our top choice when it came to developing our cloud native chaos scenarios. Its extensive list of experiments, open source nature, and friendly community gave us all the ingredients we needed to successfully complete our goals.”

End users like it too. Samar Sidharth, an Orange lead engineer, said, “Litmus is a great tool that offers out-of-the-box generic chaos tests with different types of probes for performing validations at different times during the experiment, which makes automation easy.”

“The cross-section of personas practicing chaos has grown wider over the past couple of years,” observed Karthik Satchitanand, Litmus’s project maintainer and ChaosNative open source lead. “This has brought forth numerous viewpoints, resulting in features around chaos management, observability & CI/CD integrations. It is also heartening to see developers build their own probes for steady-state hypothesis validation and experiments using Litmus’s BYOC (bring-your-own-chaos) approach.”

Rapid Improvement

The program has also continued to improve at a rapid pace. LitmusChaos 2.0 was released in August. This release brought improved scalability along with new features. These included testing against and measuring outputs from the steady-state hypothesis and an increased set of Prometheus metrics for instrumenting application dashboards for better observability.

Today, its main components are:

  • Chaos Operator: built using the Operator SDK framework and manages the lifecycle of a chaos experiment.
  • ChaosHub: hosts most of the chaos experiments needed for a quick start in chaos engineering.
  • Litmus Workflows: Chaos experiments are chained either in sequence or parallel to build a chaos scenario. The workflows are declarative, schedulable, and browsable. Workflow analytics are also available.
  • ChaosCenter: A centralized control plane to design, schedule & monitor Litmus Workflows, with the ability to manage chaos across multiple target environments via agents. The chaos-center supports teaming to facilitate collaboration on chaos scenarios and helps analyze resilience behavior across runs.
  • Litmus Probes: Various probes help users create complete chaos scenarios with automated steady-state validation and remediation actions, close to the real application experience upon failure.
  • Chaos Observability: Litmus exports Prometheus metrics that can help to highlight and quantify the impact of chaos on the applications or infrastructure in real-time via in-house dashboards and external visualization on APM tools.

Bring Your Own

LitmusChaos also supports the amusingly named bring-your-own-chaos (BYOC). With this, you can integrate third-party chaos tooling to inject chaos. Moving forward it will also support such fault injection types as IOChaos, HTTPChaos, and JVMChaos.

Looking ahead, LitmusChaos’s roadmap includes new features. These features will include an increased set of experiments both for Kubernetes and non-Kubernetes targets, improved observability and integration with other platforms via OpenTelemetry, and more. In addition, its developers see working hand-in-hand with other CNCF continuous delivery and service mesh projects. The idea is to incorporate LitmusChaos completely into the CNCF ecosystem.

LitmusChaos was created for cloud native chaos engineering. But, its developers also plan on expanding it into old-style virtual machines and cloud infrastructures such as Amazon Web Services, GCP, Azure, VMWare, and even bare metal. In short, it will become if all works, well a universal chaos engineering tool.

This project is already doing very well. If you’re considering using chaos testing in your cloud-native programs — and you should — LitmusChaos deserves your attention.

Feature image via Pixabay