Chaos engineering has matured beyond Netflix’s original Chaos Monkey project, but what the practice actually means is still in the eye of the beholder. Sixty-three percent of over 400 IT professionals Gremlin surveyed for its “2021 State of Chaos Engineering” have performed a chaos experiment in a dev or test environment, but “only” 34% have done so in production. Isn’t chaos engineering by definition testing production systems? The overall state of market adoption is definitely lower than this self-selecting sample, but 40% of the study have never conducted a chaos attack before. For context, a 2020 study found that 26% of site reliability engineers used a chaos engineering tool.
When asked, practitioners asserted that availability and mean time to recovery (MTTR) top their lists of chaos engineering benefits, with a reduction in the number of pages (a proxy for severe incidents) being less important. While the study collected benchmarking data, there is no way to determine if chaos engineering actually caused an improvement in these stats.
We believe organizations that embrace progressive delivery practices are well-positioned with the ability to make experimental changes into production environments. Whether these called canary deployments, feature flagging or chaos tests, it doesn’t really matter. Organizations with poor service availability levels are less likely to have these abilities than organizations with 99.9% or better uptime. The next step is to create a scalable model that lets us test the hypothesis that chaos engineering is worth the investment.
Gremlin is a sponsor of The New Stack.