There are two things that seem to motivate developers — a speedy, self-explanatory onboarding experience and a bit of friendly competition. Certainly, Gremlin chaos as a service’s new Scenarios features seems to check both boxes.
The Scenarios feature, which launched Thursday at the company’s Chaos Conf user conference, allows customers to test their system’s ability to withstand common cloud outage scenarios. Six ‘Recommended Scenarios’ for site reliability engineers, which run on the company’s chaos testing hosted service, simulate real-world website and software failures out of the box. From traffic spikes to unreliable networks, each template is based on a very public, real-world outage.
Getting Started with Chaos Engineering
Chaos engineering has engineers simulating the injection of strain into distributed systems. This prepare-for-the-worst best practice allows teams to learn about and prepare for system weaknesses.
The growing Gremlin team, now nearly 100 people, is made of ex-FAANG (Facebook, Apple, Amazon, Netflix and Alphabet’s Google) engineers who used to do chaos testing at companies like Amazon, Google and Netflix, said Gremlin’s Director of Product Lorne Kligerman.
The idea for Scenarios pulled from their former chaotic lives as well as from customer success and developer advocates. “We know things will fail today. We know things will fail in the future — systems are so complex. But do you know how it will fail?” Kligerman said. “The industry is moving from why chaos engineering to how do we do it?”
Scenarios arose as a way to offer immediate value to Gremlin customers, including those in the Freemium tier. It’s a way to get started with chaos engineering, concentrating on “things that really affect your company.”
“With the Gremlin features you are able to do a lot of custom things — but many of our customers who are just getting started with chaos engineering want immediate value and an easy way to get started,” Kligerman said.
These six use cases are ones that teammates most often found at their previous jobs strengthening systems and that are plaguing their customers, particularly in the e-commerce space.
For each Scenario, you can write a description and hypothesis, as well as write notes and observations if an incident was detected. Each Scenario comes with a checkbox, logging:
- Was are the expected results?
- Was an incident detected?
- Was the incident mitigated?
You run all the recommended errors, uncover issues, fix them, and run it again. Typically you would then set the Scenarios to run daily or weekly to avoid regression.
Kligerman said that Scenarios was built “to be able to prevent really commonplace outages happening to your business and your application so your customers aren’t affected.”
The six templates offered with this release are real-world outages that you’ve probably already seen in the news. Each is focused on common challenges facing companies moving to the cloud or distributed systems.
- Unavailable Dependency — This happened to U.S. department store giant Target back in June when all the registers went down. This Scenario replicates this error when an API call stops responding.
- Unreliable Networks — The increasing popularity of microservices architecture means a reliance on frequent and responsive API calls. How does your system respond when the APIs it’s connected to take longer to respond?
- Traffic Spikes — Also known as a Cyber Monday nightmare. Or last year’s Prime Day. This Scenario lets DevOps teams progressively add CPU load from ten to 100 percent on selected hosts. It helps you plan for unusually high traffic, fine-tune thresholds, and test failover architectures.
- Region Evacuation — When you rely on a managed service like the Google Cloud Platform or Amazon Web Services, and a network connection gets cut, do your services fail over to the nearest region? Recently this happened over Labor Day when a power outage at AWS’s US-East-1 region not only fried some hardware but caused some customer data to be lost.
- Host Failures — What happens if one of your hosts become unhealthy or you are scaling down and you aren’t using it anymore? This Scenario will automate the failure so you understand how your system reacts to scaling up and down.
- Chaos Scenario #6: DNS Outage — A lot of the Internet is just sometimes unavailable. Kligerman says a lot of companies do not have a secondary DNS provider, nor do they have the DNS cached anywhere.
A lot of these outage examples are associated with Fortune 100 enterprises, but Kligerman warns that smaller companies need automated chaos engineering as much as the big ones.
“Everything is so interconnected. Everyone is using so many managed’ services, so many SaaS products.”
He went on to say that a smaller company will often rely on distributed offerings for each service. Whether they are using authentication, storage, streaming, etc, they aren’t building a lot on their own, potentially making them even more vulnerable to these outages.
Beyond the six Recommended Scenarios launching today, Gremlin also offers Custom Scenarios, that companies can model after other outages.
Scenarios also allows you to link attacks together in a chain, encouraging incremental growth and combining the blast radius and magnitude of any attack or failure. “The practice we teach is to grow the blast radius incrementally — and grow bigger the magnitude of the attack,” Kligerman said.
He says that many people — probably due to the name — think chaos engineering experiments is random. Nothing could be further from the truth.
“We believe in a much more thoughtful approach that includes creating a hypothesis, starting small, and gradually increasing the blast radius to the point where you feel your system is resilient to that particular failure mode or scenario,” Kligerman explained.
In the future, Gremlin may set up a mechanism for users to share Custom Scenarios. Kligerman says the end goal is to build “A whole ecosystem where everyone is helping each other become resilient.”