DevOps / Networking / Sponsored

Chaos Engineering: What It Means, Why It Matters

23 Sep 2020 3:00pm, by and

Gremlin sponsored this podcast.

Chaos engineering certainly evokes a lot of interest these days, especially as organizations increasingly rely on widely distributed data infrastructures that can extend across multicloud and on-premise environments — where the risk of failure grows exponentially. But while many agree that chaos engineering involves planning in some way, a widely accepted definition still remains elusive.

For Kolton Andrus, CEO and co-founder, Gremlin, chaos engineering is “is one of my favorite topics for debate,” and “is what makes chaos engineering sound fun and exciting.”

In this edition of The New Stack Makers podcast, Andrus defines chaos engineering and describes how organizations can make it work for them. Alex Williams, founder and publisher of The New Stack, hosted this episode.

Subscribe: SoundCloud | Fireside.fm | Pocket Casts | Stitcher | Apple Podcasts | Overcast | Spotify | TuneIn

The very idea of chaos — and an IT organization’s embrace of it — can conjure up fear in many. “[Chaos engineering] scares the pants off of some old school folks that aren’t comfortable with that kind of chaos in their environments. And so most people think chaos engineering is randomly breaking things and seeing what happens,” said Andrus. “I think that chaos engineering is thoughtful, planned experiments that teach us about our system and one of the key concepts that goes with that is this idea of the ‘blast radius.’ When we run this experiment, whom might we impact? Because the goal is to prevent outages, not to cause an outage and we never want to inadvertently cause customer pain. We never want to cause an outage because we were being cavalier in our approach.”

Andrus brings a deep background of the subject to the debate. Prior to founding Gremlin — as one of the pioneers in chaos engineering — Andrus became heavily involved in helping to avoid service outages, first at Amazon and then at Netflix. “When an outage happens, it’s time-intensive and expensive. It’s damaging to your brand,” he explained. “And if you work at a place like Amazon or Netflix, an outage costs hundreds of thousands to millions of dollars and so preventing every outage and preventing every minute of downtime is worth the investment.”

While his work at Amazon was more infrastructure-intensive, his mission at Netflix, as part of the API team, focused on application-level fault injection, involving injecting failure or delay in a specific service or function, such as managing customers’ identities, recommendations or recently watched movies.

“What would happen if one of those failed? Well, truthfully, if I can’t get your recently watched movies, I probably shouldn’t just crash the application — we can gracefully degrade and give you a cast list or just not show you that and you can continue on,” said Andrus. “And so that allowed us to go through and be very, very precise about where we wanted to run these experiments.”

A main takeaway at Netflix involved the business cases, such as understanding what the customer saw and what “the right behavior for the system is,” he explained. “And then we can go fix things so that when things go wrong, customers don’t see it — and they’re able to do whatever they came to do.”

The core technology infrastructure behind Gremlin’s experiments mainly relies on its agent, while the “future of where we’re going” is helping “people to measure the reliability of their services and to assess the potential risks that happened.” Gremlin’s team will “even run those experiments for them and tell them whether their system behaved correctly, or give them the set of things that weren’t handled correctly, so they have a shortlist of things to go fix and improve,” said Andrus.

A newsletter digest of the week’s most important stories & analyses.