DevOps / Sponsored / Contributed

3 Key Takeaways About the State of Chaos Engineering

8 Mar 2021 10:15am, by

Matthew Fornaciari
Matt is co-founder and CTO of Gremlin. Previously, he was a senior platform engineer at Salesforce, where he led the charge to bolster the experience of viewing and editing each and every record. Before that he improved the reliability and customer experience of the Amazon Retail website, where he founded the Fatals team which reduced the number of website errors by half in its first year.

I’ve been doing chaos engineering for nearly a decade, dating back to my time at Amazon — whereas an engineering lead, I founded the “Fatals” team. We were responsible for diagnosing shortcomings in code quality and developing innovative tools to analyze and resolve systematic failures across the Amazon platform.

It’s not a mystery why chaos engineering — and SRE more broadly — were largely born out of companies like Amazon, Netflix and Google. These are companies with massive, complex systems and a user base that feels the pain of downtime acutely. Looking at the cost of downtime for the top e-commerce companies, Amazon loses roughly $200,000 for each minute that the website is down.

Today, however, nearly every business is an online business. The pandemic has only accelerated this transformation for many companies. We founded Gremlin five years ago with the mission to make the internet more reliable through both education and tooling, helping customers espouse the practices my co-founder Kolton Andrus and I garnered from years of working at places like Amazon, Netflix and Salesforce. We knew that popularization of the cloud and microservices would equate to an increase in complexity for everyone. We like to say that if you want to be like Amazon or Netflix, then you better be ready to inherit the challenges that come with that scale!

All of the interconnected services at Amazon and Netflix

Simply put, today’s systems are far too distributed and complex for any one engineer or team to fully understand. So how do we respond to this truth, as an industry? I’ve heard many people say, “my system already has enough chaos, we don’t need to add more!” And that is exactly the attitude we need to dispel. Chaos Engineering is not about adding random chaos, it’s about introducing controlled chaos — to validate our assumptions and better understand what actually happens when systems misbehave. Those problems will continue to exist in your system whether you decide to address them or not, so why not proactively prod those problems to manifest via GameDays, during normal business hours, instead of as customer-facing outages at unpredictable times?

Over the past five years, the Gremlin team has put just as much effort behind driving the cultural shift of being more proactive in operations, as we have behind building the tooling to safely and securely run the experiments. We wanted to get a snapshot of how well engineering teams are adopting and understanding chaos engineering, which led us to produce the first-ever State of Chaos Engineering report.

Here are some of my key takeaways:

1) Consistent Chaos Engineering = Higher Levels of Availability: It was great to see this fundamental thesis validated by the market. Certainly, we’ve known for a long time that getting ahead of problems saves companies time and money, and improves their overall reliability. But as with any new discipline, it’s even more important to see repetition and the formation of the habit. The most successful organizations have not only adopted chaos engineering as a practice, but they also execute attacks on a regular basis. 45.9% of companies with availability greater than 99.99% are executing attacks on at least a quarterly cadence.

2) Companies with high availability are early adopters. Companies that are early adopters of modern practices, such as canary deployments and feature flagging, are the same companies reporting the highest level of availability (99.9%+ uptime). The tools of particular interest noted in the report were DNS failover/elastic IPs, circuit breakers, and select rollouts of deployments. This highlights that Chaos Engineering is part of a larger set of tools and processes that high-performing teams are adopting.

3) C-Levels need to be more involved in resilience efforts. The fact of the matter is, the engineering culture you incentivize is the culture you will have. So, for example, if you only promote engineers based on product velocity — and not on how well those new features have been tested and can withstand failure — then your engineers will simply not prioritize building with reliability in mind. This was the impetus behind Chaos Monkey at Netflix; engineers knew that at any time, servers could be unplugged and their systems had to be built to withstand those failures. The following chart should be a call to action for all executives to take a greater interest in the work your teams are doing to make your products more reliable and improve your customer experience.

We expect to continue to see broader adoption of the practice of chaos engineering and look forward to uncovering new trends in the next report. The chaos engineering community continues to see new faces and talented engineers evangelizing the discipline. We’d love to hear more about how your team is approaching chaos engineering — and if you need help getting started, don’t hesitate to reach out to me on Twitter!

Feature image via Pixabay.

A newsletter digest of the week’s most important stories & analyses.