How and Why to Respond to Incidents the Right Way
No matter how well run an application is, eventually it will have problems. These problems could occur thanks to many things: code/infrastructure changes, bugs in your application, weird input from users, backhoe incidents, even cosmic rays. No matter how well made your application is, something will cause a problem in your application at some point.
What differentiates skilled operators from unskilled ones is in how they identify and respond to issues. In this article, I’ll discuss what incidents are, why you need to care about their response, how to use an incident response platform to help and what should be done after an incident occurs.
What Is an Incident?
A common misunderstanding of “incident response” is that it’s really about “outage response” — dealing with times when your application becomes unavailable. While an outage is certainly always an incident, not every incident is an outage. High latencies, unexpected drops or increases in traffic, even a higher volume of customer support tickets could all be incidents, maybe at different severity levels.
A good working definition of “incident” is any situation where performance or availability of your application, or a key business metric, is significantly different from what you’d expect.
This includes outages — obviously, you expect the application to be up — but also the other examples mentioned above.
Another key factor in distinguishing an incident from just a problem or unexpected behavior is the need to respond to incidents. This is a key tenet of modern operations practices: Incidents should occur as infrequently as possible and be dealt with quickly to minimize operational impact.
Why and How Do We Respond to Incidents?
Responding to incidents does two things: It ends the incident as quickly as possible, and it (ideally) prevents the same incident from happening again. To respond to an incident, these steps are generally followed:
- Triage: How severe is the incident, and who can fix it? The more severe it is, the more urgently the right person/team must be notified that the incident has occurred.
- Resolve: Once the right people have been notified, they fix whatever is causing the incident to restore normal operation.
- Review: A key step in responding to incidents is reviewing them after they’ve been resolved. Why did the incident occur (a root-cause analysis)? How could it have been avoided? How can we modify our application, infrastructure, alerting or process to make sure that this particular incident can’t happen again?
These steps work together to create a virtuous cycle of increasingly fewer and less impactful incidents over time.
What Features Should an Incident Response Platform Have?
To make resolving incidents as painless as possible, many vendors have released incident response platforms. Here are some key features that such a platform should have:
Your incident response platform needs to be integrated with your observability and monitoring infrastructure, as well as your chat/collaboration tools, ticketing system and other business applications. A robust library of integrations is a vital feature for a good incident response system.
Alerts that the incident response system deals with will likely come from many sources, sometimes at the same time (such as an AWS CloudWatch alarm and a Splunk Observability Cloud alert). A reliable tool will group these alerts together when they are about the same infrastructure or application, reducing the number of incidents and providing context to the person responding to them.
Reliable, Flexible Notifications
An incident can’t be resolved if the people who can fix it don’t know about it. An incident response tool must have notifications that can be delivered via multiple methods, to multiple people, all at the same time. These notifications should be configurable in a flexible way so that you can decide what’s appropriate for different severity levels of an incident or for different teams.
Many people who are stakeholders in dealing with incidents don’t directly respond to them. Many incident response tools charge per user, making them expensive. You’ll want to choose a tool that offers free or significantly reduced-price, read-only users who can determine the state of incidents without being able to respond to them.
What about after the Incident?
Reviewing incidents and making changes as a result of these incident reviews is the secret to reliably and successfully operating modern applications. After an incident is resolved, an incident review (or “post mortem”) is the next step. In this process, a blame-free analysis of what happened with the application, infrastructure and people is performed. The goal is to determine exactly what happened, why it happened and how systems or processes can be changed so that the incident never happens again. A full description of how to conduct an incident review is beyond the scope of this article, but the key takeaway is that these reviews must be performed.
If you don’t make changes after each incident, the incidents will continue to occur, leading to dissatisfied employees, unhappy customers, less reliable applications and ultimately worse outcomes for the business. A robust incident response tool and the right mindset to eliminate incidents are two key factors in operating a well-designed application. Using these tools and performing post-incident reviews will ensure it continues operating for years to come.