Gremlin-Spinnaker Integration for Automating Chaos Engineering
It was the logical place to start, according to Matthew Fornaciari, Gremlin co-founder and chief technology officer. “It just made sense.”
Spinnaker, a platform developed for highly distributed and hybrid systems, grew out of Netflix. Not coincidentally, Gremlin co-founder and CEO Kolton Andrus, was a chaos engineer at Netflix and helped build its second generation of chaos engineering tools known as FIT (Fault Injection Testing).
Fornaciari described chaos engineering as “taking reactive to proactive.” It’s introducing stressful conditions on your systems to discover and fix the things that will break, such as when you scale.
“The whole idea of integrating with CI/CD is not drifting into failure. There’s a lot of moving parts. Applications will continuously change, the infrastructure will continuously change and the whole idea is to integrate them into your CI/CD so you never suffer the same failure twice,” he said.
The New Stack has written about how chaos engineering can build stability in distributed systems. And Gremlin site reliability engineer Tammy Bütow pointed out things you need to do before adopting chaos engineering, such as understanding the cost of downtime for your business and having good monitoring in place.
The integration with Gremlin means Spinnaker users can automate chaos experiments across multiple cloud providers including AWS EC2, Kubernetes, Google Compute Engine, Google Kubernetes Engine, Google App Engine, Microsoft Azure, OpenStack and more.
“You inject a little bit of failure so you can build up a resilience tolerance. You want to start off with a very basic failure model. Boost CPU to 100% or something like that,” Fornaciari explained.
“Then you want to increase the blast radius, is how we describe it. You want to start very small, maybe with one host, then you want to increase. As you become more confident — at first you want to run it by hand, then eventually you want to integrate it into CI/CD. You want to make it so that after you increase the blast radius, you can say, “Cool, we’re going to run this all the time,” and make this part of our build process. We’re never going to have this failure again because we’re constantly running this.”
There’s a certain maturity level required, and many organizations aren’t ready to break things, even if offered the chance to do so on their own terms.
“We’d like to see [organizations] automate these things right away, especially for huge enterprises with microservices, cloud and distributed applications,” said Adam LaGreca, Gremlin director of communications.
“Once you get mature enough to automate these things, it takes a lot of the manual and guesswork out of it. But a lot of people are new to it and start with manual experiments. They’re playing around and understanding how to do chaos engineering before they automate it, but we do think that once you reach a certain maturity, you do want to automate these experiments,” he said.
Yet Chaos Monkey only does one thing: Randomly shut down servers. Gremlin’s full toolkit enables users to throw nearly a dozen different attacks at their systems, such as resource, state and network attacks, in order to create backup plans to be ready for such scenarios.
Feature image via Pixabay.