Culture / DevOps / Technology / Sponsored / Contributed

Use Chaos Engineering to Strengthen Your Incident Response

5 Nov 2021 7:00am, by

Mandi Walls
Mandi is a DevOps advocate at PagerDuty. She is a regular speaker at technical conferences and is the author of the O'Reilly Media white paper 'Building a DevOps Culture.' She is interested in the emergence of new tools and workflows to make the task of operating large complex computing systems more approachable.

Launching new software into production can be a tricky experience. While we strive to make our non-production environments as close to production as possible, there’s always a compromise or two that means the integration environment isn’t exactly like production.

Our testing procedures and other tools help reduce the potential for bugs and problems to make their way into customer-facing environments. We can also improve our team response to incidents in production, so that when something does happen, we’re ready to handle it.

We can add chaos engineering to our workflows to test assumptions about how our applications will behave in production, as well as practice our incident response processes. Julie Gunderson of Gremlin recently joined me on PagerDuty’s Twitch channel to talk about integrating Gremlin with PagerDuty, and how your team can benefit from regular chaos experiments. The recording is now on our YouTube channel for folks who couldn’t join us live.

 

Chaos engineering is a methodology for inserting error states into your systems and applications. The definition from the Principles of chaos engineering website is:

“Chaos Engineering is the discipline of experimenting on a system in order to build confidence in the system’s capability to withstand turbulent conditions in production.”

— principlesofchaos.org

Large-scale distributed systems present unique challenges related to concepts like scalability, reliability and dependency management that have only increased over time. Chaos engineering as a practice emerged from Netflix and has reached a point where it is accessible to many teams via commercial tools like Gremlin.

When a team is practicing chaos engineering, they are pushing boundaries on the system. Chaos tests might include slowing down requests on a backend dependency or on a central service like DNS. They might increase the network latency in general, or make components unavailable to give teams a view into what the system will do.

If a backend dependency of your application becomes unstable or completely unavailable, what should your application do? Your team can use the outcomes of chaos experiments to determine whether a feature should be turned off in certain incidents, or if improvements can be made to increase its resiliency. Knowing how the application will behave gives you a place to start when making these decisions.

There are now a number of good resources for learning chaos engineering. The Principles of chaos engineering website is a great place to start, as is the Gremlin website. You can also get a look at Chaos Monkey, the Netflix project that started it all.

What Role Can Chaos Engineering Play in Incident Response?

One of the toughest things about being good at incident response is that to practice, you would need to have a lot of incidents.

No one wants to have a lot of incidents.

So teams introduce practice sessions, maybe calling them GameDays or Failure Fridays. The goal is to create some space in your workflow to practice what the team will do when something happens to the production infrastructure. It’s also an opportunity to practice the kinds of errors that you know are possible, but hopefully will never happen, like the loss of an entire data center — or are more mundane but also unlikely to have happened in testing, like the loss of a disk in the datastore.

By adding chaos engineering to your incident response practices, you don’t have to wait until all the weird things actually happen to know what the systems will do and how you should respond. You can simulate these failures using chaos engineering tools.

Building a better response means lowering your mean time to resolution (MTTR), increasing the availability of your services and preserving your SLOs. Planning an exercise where teams inject chaos experiments into their applications and then respond to the incident helps ensure that teams know how escalation policies will work in a real incident, how to respond to notifications, where to find information about the health of the system and then how to triage and resolve. They also benefit from practicing the communications protocols that your organization uses during an incident, including deferring to an incident commander or liaising with customer support teams.

If a chaos experiment reveals that an application is particularly vulnerable to certain kinds of problems — for example, slow backends, network latency to the data store and increased requests — future engineering efforts can be prioritized to shore up those weak points before they become troublesome.

Improve Your Alerts and Your Sleep

Combining chaos engineering with incident response workflows helps teams build confidence in their processes and know that when an actual incident happens, they will know what to do. These methods help to reduce the uncertainty and complexity that comes with large scale distributed systems, shining a light on the dark places in-between pieces of complicated interconnected systems.

Chaos engineering also gives your team the opportunity to ensure that monitoring and alerting are working correctly and are providing effective information. During a chaos experiment, your team may find some blind spots where critical indicators aren’t covered by existing monitors. You might also find the opposite case, where a configured alert is a red herring and not really indicative of the problem you’re looking at.

Cleaning up alerts can also mean turning down the urgency of some alerts. As your team practices chaos engineering as a regular exercise and improvements are rolled out from the findings, you should revisit the thresholds for your alerts and make sure they are still effective and necessary. Improving tolerance in your applications for slow backends or other components should mean fewer alerts to your team when those components aren’t behaving as expected.

Don’t wake your team up in the middle of the night for incidents you could have known about in advance. Save that disruption for the unknown unknowns in your environment.

Learn More

Chaos Engineering has come a long way since Chaos Monkey. These resources can help you get started:

PagerDuty has a few resources to help you get started as well. Improve your reliability and practice your incident response with your teams.