Honeycomb is sponsoring The New Stack’s coverage of Kubecon+CloudNativeCon North America 2020.
“Why do chaos engineering for Kubernetes? It’s because your application resiliency depends on other cloud-native applications.”
This is how the Chief Operating Officer of cloud native storage software provider MayaData, Umasankar Mukkara, began his talk at KubeCon+CloudNativeCon last week. In fact, he quoted that 90% of the average application’s resiliency is reliant on other applications.
He described the stabilizing effect of chaos engineering as the process of introducing a random fault into your system that is running at a steady state. If it remains steady, you’re good. If you not, you’ve found a weakness.
Mukkara was joined by Sumit Nagal, principal engineer at Intuit. Both are maintainers of LitmusChaos, an open-source cloud native chaos engineering framework for Kubernetes, LitmusChaos, which entered the Cloud Native Computing Foundation sandbox last June. They presented how Intuit, as a CNCF end user, uses LitmusChaos to manage and orchestrate cloud native experiments, including creating DevOps chaos workflows.
Litmus’s Declarative Flavor of Chaos Engineering
LitmusChaos provides custom APIs via CustomResourceDefinitions or CRDs to orchestrate chaos on Kubernetes clusters.
LitmusChaos “works in cloud native, totally declarative way,” Mukkara said, which means it allows you to define chaos like a custom resource within Kubernetes. This customizability works the same at the infrastructure level, at the application level, and within Kubernetes nodes, as well as other resources inside the node like memory, CPU and discs.
He went on to say that “Litmus provides all that’s required to run chaos engineering at scale across your enterprises.”
This includes the ChaosHub, which allows even the user with limited experience to introduce “off-the-shelf” chaos into their systems, onboarding in three simple steps. It now includes 22 generic experiments:
- Pod delete
- Container kill
- Pod network latency
- Pod network loss
- Pod CPU hog
- Pod memory hog
- Disk fill
- Disk loss
- Node CPU hog
- Node memory hog
- Node drain
- Kubelet service kill
- Pod network duplication
- Node taint
- Docker service kill
- Pod autoscaler
- Service pod — application
- Application service
- Cluster pod — kiam
- Pod IO stress
- Node IO stress
Pod delete is by far the most popular chaos template, while memory and service reliability are also used often.
Mukkara said that LitmusChaos is highly extensible and users can use the Litmus SDK in BYOC “bring your own chaos,” which they are then encouraged to contribute to the project.
Litmus uses another 2020 CNCF sandbox addition, the Argo GitOps-oriented Cloud Native Continuous Integration/Continuous Deployment (CI/CD) tool to create chaos workflows at scale, allowing consolidation of the results of different experiments.
Mukkara explained that “Because this entire workflow is configured declaratively you can practice chaos engineering using GitOps” where “you set up a chaos workflow which results in a set of chaos metrics and events, which are uploaded to Prometheus,” which is a CNCF monitoring tool.
These chaos workflows were initiated by the Inuit team in order to execute chaos while simulating other workload behavior in parallel.
Intuit Applies Litmus Chaos Workflows to a DevOps Pattern
The Intuit Developer Platform has 4,000 software developers with 2,500 services on 230 clusters — and growing. The reliability team, which Nagal leads, has been working with chaos engineering for about three years now.
Nagal and Mukkara began their Litmus proof of concept last February. In October they open sourced the Litmus plug-in infrastructure and the Litmus Python and Argo workflow, which includes the Argo Workflow, performance and chaos with Argo, and the Argo workflow via Jenkins.
At Intuit, the team built a plugin infrastructure where all their work was done by custom resources. They are using role-based access control to target specific applications and Kubernetes specific namespaces. All of this data is then pushed to various monitoring and observability tooling, executed by the company’s Jenkins pipeline. The chaos operator will look for the custom resources.
And then these experiments were embedded within containers. They write their chaos experiment tests and put them in the Argo workflow, written in Git and integrated with Jenkins. Then Argo executes the workflow, picking the specified experiment and launching the experiment.
Why use these unique workflows instead of something like a YAML?
Nagal said, “Logically speaking if you really want to execute everything as part of pipeline, many scenarios, it becomes very challenging. So automation was one thing. Now Argo workflow, everything is coming as a one YAML where we can just use one of the parameters to the go-submit.”
He went on to say that since everything is code, you don’t have to maintain the different kinds of YAML across their hundreds of software clusters. It also fits right into the Intuit CI/CD pipeline with automation and infrastructure as code.
Nagal continued to list the benefits of the Litmus-Argo workflow including:
- Cost savings with optimum resource utilization
- Reliability with chaos for performance
- Ability to build complex scenarios
- An ease with self-service rapid onboarding
- Covers the whole lifecycle
He says it also allows you to not only getting the statelessness of the chaos but the statelessness of the performance.
“As this whole execution is happening in a manner that is very predictable, it brings a lot of confidence in the whole set-up,” Nagal said.
The Cloud Native Computing Foundation, KubeCon+CloudNativeCon and MayaData are sponsors of The New Stack.