There Is No Resilience without Chaos
Chaos engineering is at the stage where disaster recovery was a few years ago in many ways. What typically happened back in the day before disaster recovery became widely accepted by the IT community is that a prolonged data center or server outage typically occurred leading to a catastrophic failure when the disaster recovery system did not function as it should — or even worse — did not even exist. Even more tragically, the IT team often realized after the incident that there had been obvious signs that an outage was imminent but they had failed to heed those signs because they did not know where to look. This is where chaos engineering comes into play.
Chaos engineering has emerged as an increasingly essential process to maintain reliability for applications — or in not only cloud native but any IT environment. Unlike pre-production testing, chaos engineering involves determining when and how software might break in production by testing it in a non-production scenario.
In this way, chaos engineering becomes an essential way to prevent outages long before they happen. It is defined as an overlap between reliability testing and experimenting with code and applications across a continuous integration/continuous delivery (CI/CD) pipeline and for maintenance and management tasks once deployed. Chaos engineering is achieved by obtaining metrics and data about how an application might fail when certain errors are induced through experiments. The end result is continuous resilience across CI/CD and operations.
“Chaos engineering gives us a way to practice our professional skills, especially with a focus on improving time to recovery,” Adrian Hornsby, principal system dev engineer for AWS said during his keynote while noting that the average per hour for infrastructure downtime is $100,000, according to IDC stats. “It helps us build a culture of resilience.”
How chaos engineering is used to mitigate downtime threats of money lost during downtime or worse yet, unrecoverable failure, its adoption path and tools involved were some of the main themes discussed during Harness’ annual users’ conference, Chaos Carnival, in March.
With the rise of cloud native applications reaching a total market size of $17 billion by 2028, chaos engineering and proactive resilience testing tools and services will be leveraged to ensure enterprises achieve maximum system availability and developer productivity, Prithvi Raj, technical community manager for Harness and a community lead for CNCF project LitmusChaos, estimates. All told, nine out of the top 15 banks in the U.S. are pursuing chaos engineering and six out of the top 10 biggest retail companies in the U.S. have already brought in some form of chaos engineering as a practice.
“Potentially, chaos has seen the highest demand from banking, fintech, e-commerce and telecommunication with other sectors picking up. The cloud native chaos engineering space has grown over ten-fold in the last year alone and we are experiencing almost 60,000 installations per month with an average of at least seven to eight-experiment runs per installation per month for just the open source tools out there,” Raj said. “While stats show that demographically the U.S., Canada, India, China and Mexico have the most chaos experiment runs per month, growth in Latin America and Europe has been phenomenal with Brazil, France and Germany picking up.”
The key takeaways for chaos engineering today and the Chaos Carnival conference that Raj communicated include:
- The true potential of chaos engineering can “be unleashed by embracing the practice and removing negativity.”
- Chaos engineering integrations with CI/CD tools are vital today to adhere to cloud native dependencies and the continuous system changes from the rapid release frequency of developers.
- The struggles of chaos engineering are based on “organizational structure and truth and the road ahead requires perseverance.”
- Documenting chaos engineering is important just like any other framework to avoid the trial-and-error way of functioning.
- Security chaos engineering has enabled achieving automated cloud security and “is the need of the hour.”
- The road to building chaos engineering ahead is community collaboration and “working with the CNCF to help open source projects grow.”
One of the main results of chaos engineering is continuous resiliency. To wit, it is analogous to monitoring and observability to which chaos engineering is related.
In order to achieve observability, astute monitoring is required. It is the act of proper monitoring of systems, software operations, CI/CD pipelines, etc. that leads to the state of observability required to make decisions based on the ability to process the data.
What's your favorite way to build resilience in your software and teams? We asked some attendees at Chaos Carnival 2023:
I appreciate the honesty!
— Chaos Carnival 2023 (@chaoscarnivalio) March 16, 2023
Chaos engineering, when done properly, requires observability. Problems and issues that can cause outages and the greater performance can be detected well ahead of time as bugs, poor performance, security vulnerabilities, etc. become manifest during a proper chaos engineering experiment. Once these bugs and kinks that can potentially lead to outages if left unheeded are detected and resolved, true continued resiliency in DevOps can be achieved.
In the event of a failure, the SRE or operations person seeking the source of error is often overloaded with information. “You’ve got to be able to reduce everything to actionable insights rather than just having every dashboard read and everything log browser just scrolling up the screen faster than you can read. And your observability system not only needs to stay up when everything’s down — so you probably want it on some different infrastructure — but it also needs to cope with floods of alerts without failing,” Adrian Cockcroft, a technical advisor And then to really do that to show that that the way you’re set-up, you need to run these regular chaos engineering experiments.”
Chaos engineering requires experiments and tests. Among the tools available to that, Harness’s LitmusChaos open source project and enterprise versions were discussed during the conference. ChaosNative Litmus Enterprise has helped DevOps and site reliability engineers (SREs) to adopt chaos engineering tools that are self-managed, while the cloud service, ChaosNative Litmus Cloud, offers a hosted LitmusChaos control plane. On offer are over 50 chaos experiments that LitmusChaos has already developed for you,” Uma Mukkara, head of chaos engineering for Harness, said during his conference keynote. They cover a range of Kubernetes resources cloud platforms such as AWS, GCP and Azure, and applications such as Cassandra Kafka. “When it comes to letting us use cases, you can start continuous chaos testing with litmus or start practicing game days or start using chaos engineering and your performance engineering testbed or you can start integrating observability and chaos together, Mukkara said.