Gremlin’s Tammy Bütow on the Business Side of Chaos Engineering

Tammy Bütow was doing chaos engineering long before Netflix and its poo-throwing monkey gave a name to it. About a decade ago, she worked in a bank on a team that performed disaster recovery testing. But, instead of continuous automated testing, once a quarter hundreds of people crammed in a room and threw the kitchen sink at their systems. This was and still is a requirement for most banking licenses.
A lot has changed over the last ten years and when Bütow sat down with The New Stack, she had a lot of wisdom to impart. First, let’s summarize a bit more of her ample breaking-stuff-for-good experience before we share more tips from the Empress of Chaos.
“It [your systems] will break. You can’t think that it’s never going to break. It’s better to break it first in controlled chaos.” — Tammy Bütow
In 2009, Bütow was thrilled to break — for the first time on purpose — the mortgage system at National Australia Bank (NAB).
“You quickly learn what you need to fix by breaking systems,” she said. “You learn that it’s OK to learn — it isn’t horrible” or too risky to endeavor to destroy what you’ve built.
Indeed, chaos engineering — experimenting with your distributed system to find and then fix its weaknesses — is one of the best ways to increase uptime, reliability, speed and security of the software you’re releasing.
In 2014, NAB deployed Chaos Monkey to then kill their servers 24/7, which made tech news because on-call developers started getting a good night’s sleep. The open-source Chaos Monkey tool, borrowed from Netflix, gave them confidence as they did a massive migration to Amazon Web Services, and still achieved a reduction in alerts.
In 2015 Bütow headed over to the U.S. and joined Digital Ocean on a special task force team called Tank to reduce incidents. Then she joined Dropbox as site reliability engineering (SRE) manager doing chaos engineering, with five engineers managing over 6000 database machines running on SQL.
“You automate everything heavily but you also really want to reduce the amount of incidents are happening,” Bütow said, which saw a ten-times reduction in incidents after implementing chaos engineering.
Finally last year she joined Gremlin, the first company dedicated to chaos-as-a-service. Taking it farther than Chaos Monkey which just randomly tests terminating one system at a time, Gremlin unleashes gremlin agents on your machines or inside your containers.
“Those Gremlin experiments will help you unearth weaknesses in your system,” she explained, in production.
What to Do Before Adopting Chaos Engineering
We have already written about how chaos engineering is an effective way to build stability in distributed systems. But how do you get started with it? Bütow warns against jumping right into this shiny thing and offers advice for what you need to do before you leap to attack.
From her perspective of someone specialized in building reliable systems, she has identified a few things you need to do before adopting chaos engineering.
“I think everybody should use chaos engineering but they need to do the legwork upfront to be ready to do it.” — Tammy Bütow
Chaos Engineering Must-Have #1: A Really Good Incident Management Program
If you don’t have a really good incident management program, you’ll be detecting things that you cannot find. Bütow — who has worked at a lot of places that didn’t have this in place — says it must include “an ability to detect, diagnose, and mitigate high severity incidents that are customer impacting.”
She continued, “My whole goal is that if you have a really great way to manage incidents, you should have a really fast mean time to detect (MTTD).”
This can only be achieved through automation with a tool that detects and notifies within the toolset your team is most comfortable with, usually with integrations in Slack and Jira.
“You need some sort of alert and know that’s happening to be able to just detect why that’s happening — in under five minutes — [which is] really hard unless you automate it,” Bütow said.
Chaos Engineering Must-Have #2: Really Good Monitoring in Place
Have some good monitoring in place. This should be logical, but you not only need to know when an issue occurs and who is going to fix it, but you also need to know what exactly happened.
Bütow says that without monitoring “You don’t have a way of tracking the weaknesses and your chaos engineering may show you have problems but you won’t actually know that problems are happening without good monitoring.
Chaos Engineering Must-Have #3: A Really Good Idea of the Business (and Financial) Impact
Before doing any chaos testing, know the impact it will have on the business. This gets to the crux of a lot of things we write about here at The New Stack — where culture, technology and business objectives meet. Chaos engineering can’t be isolated to developers. Before embarking on this resiliency action, try to figure out the risk of not doing it and, particularly, the cost of downtime, right down to broken service level agreements and lost customers.
“The other thing I noticed people find really hard to do is to be able to show the business cost and the customer-related cost to an incident,” Bütow explained.
What is the true cost if there’s an eight-hour outage? A two-day outage?
She says this completely depends on your company. Do you have a freemium or pay-only model? If you are providing a service that your customers are paying for, they expect that service to be up and running. How many customers do you have? How long would that service be down if this happens? What about that?
Bütow gave the example of an airline booking system. It’s not just the loss of new ticket sales, but the cost of rebooking everyone, accommodations, food and drink vouchers, and a loss of trust for a damaged brand. Last year British Airways was down for ten hours — they have estimated that meant an $80 billion pounds loss in revenue.
“The hard thing is there’s no quick fast easy way to measure the cost of outages,” Bütow admitted.
First and foremost, it’s about having finance and your SRE team sit down together.
She also calls for the industry to share stories for how to measure the impact. She even monitors a Slack community dedicated to chaos engineering.
“The companies that get the most value out of chaos engineering understand the cost of downtime.” — Tammy Bütow
“When you do chaos engineering, it’s most important to focus on your critical services that customer facing,” she said. “I really think that companies should roll it out to their most critical customer-facing service first.”
Everything you do with chaos engineering is focused on reducing outages and incidents. As you are finding out the business impact of these things, Bütow says you will develop categories — what she calls SEV-Zeros, focusing first on the, most severe, catastrophic incidents, and then, once you’ve tackled them, go to the SEV-2s, which cost your company less but maybe still keep devs up at night.
Whiteboard That Chaos
Gremlin organizes game days with new customers, both the tech and finance side. Bütow says that spending an hour in a room with a whiteboard — drawing out your system and then asking what would happen if this broke or that, and then asking what would be downstream — is a great way to visualize your systems and plan out your chaos engineering.
She says to follow a blast radius — start small and gradually expand.
“You can learn so much about a system by just taking an hour to try and sit there and draw it out,” Bütow added.
Next Up, Continuous Chaos
Launched in December, the Gremlin software was created to go farther than Netflix’s Chaos Monkey. It still checks if things go down, but it also allows you to attack for other common issues that can hurt your systems and your bottom line, like latency and firmware bugs. Gremlin’s chaos as a service is about continuously — once a day or once a week — throwing 11 different kinds of attacks at your systems — from resource to state and network attacks, so you create backup plans that work. In Gremlin, you can customize the amount of time and calls, with a default to do one call every 60 seconds. You can use the pre-built templated attacks or create your own.
Bütow calls CPU resource consumption the “Hello World” of chaos engineering. She claims it’s better than a shutdown attack because people may find that overwhelming and people often do not have the automation in place to replace the instance.
She offered the example of using chaos engineering on databases like she did while working at DropBox. That company had a master with three replicas underneath it. If she shut down a database, a new replica was created and a clone popped in as a replica, so again there would be a master with two replicas. The replica would then be promoted to be the master if she shut down a master.
“People are only used to shut down attacks from Chaos Monkey. There’s a whole array of attacks, and that’s why we’ve pre-built them for people so they can go in and run them and see the impacts on the infrastructure.” Bütow said.
“Design systems with failure in mind from the beginning.” — Tammy Bütow
By automating this continuous chaos, you can continually work on improving systems. Bütow says this is done by prioritizing values of safety and security, and only then can you roll out chaos engineering in a controlled and elegant way, and you can really make sure you are testing the right thing.