Monitoring / Security / Contributed

Better Incident Management Requires More than Just Data

22 Sep 2021 10:00am, by
Cole Potrocky
Cole Potrocky is the co-founder and CTO of Kintaba and was a founding engineer on the Facebook Workplace team.

It’s 1900, and the cobra population is out of control in Delhi. The British Government, outsiders by any definition, brainstorm ways to deal with the issue. They stumble upon an obvious, if macabre, answer: pay people for each cobra head they bring to the crown. Everything goes swimmingly for a while: the cobra population is going down and people are getting paid well for facilitating the decline, perhaps too well. People start breeding cobras to collect a higher bounty. The cobra population explodes. The problem is now worse than ever.

The British Government got caught in a perverse incentive: they started rewarding people to make their problem worse. Perverse incentives are everywhere today, and they happen because of a lack of understanding of a problem. One must ask: how do incentives distort the problem I’m trying to solve?

“There is a quality even meaner than outright ugliness or disorder, and this meaner quality is the dishonest mask of pretended order, achieved by ignoring or suppressing the real order that is struggling to exist and to be served.”

— Jane Jacobs, The Death and Life of Great American Cities

To the uninitiated, all complexity looks like chaos. Real order requires understanding.  Real understanding requires context. I’ve seen teams all over the tech world abuse data and metrics because they don’t relate it to its larger context: what are we trying to solve and how might we be fooling ourselves to reinforce our own biases?

In no place is this more true in the world of incident management. Things go wrong in businesses, large and small, every single day. Those failures often go unreported, as most people see failure through the lens of blame, and no one wants to admit they made a mistake.

Because of that fact, site reliability engineering (SRE) teams establishing their own incident management process often invest in the wrong initial metrics. Many teams are overly concerned with reducing MTTR: mean time to resolution. Like the British government, those teams are overly relying on their metrics and not considering the larger context. Incidents are almost always going to be underreported initially: people don’t want to admit things are going wrong. If people are judged on their ability to close incidents quickly, they’ll close incidents too early, or declare them too late.

Companies just adopting an incident response strategy should focus on metrics to help normalize failure as a regular component of doing business. Incident count is one of those metrics: paradoxically you should expect your company’s number of incidents to increase, as you begin to embrace a culture of failure and learning.

Three Ways to Actually Improve Incident Response

  • Embrace Failure. Early on you need to normalize failure, so looking at increasing incidents is important. Once you have a track record of actively recording incidents, you can consider measuring MTTR because you’ll have a proper baseline. And you’ll have created a culture where, if a major incident requires a lot of time to reach a resolution, you won’t be so influenced by measuring MTTR that you take a counter-productive action like closing out the incident early.
  • Work to understand the context. No metric works without understanding of the larger context. C-suite should practice scuttlebutt and watch incidents take place (or even participate) to understand the pain points of day-to-day incidents, and to understand how metrics rarely show the full picture. As an executive, if your level of involvement is just wanting to see a report each month with metrics like MTTR decreasing — you’ll never actually create a resilient culture.
  • Don’t suffocate the process. When your product isn’t working, it can be tempting to push your responders to fix things quicker. Build trust with your teams, and acknowledge that they’re composed of human beings who need time and space to solve difficult issues. Micromanaging or over-optimizing during active incidents just increases stress and paradoxically will reduce your teams’ ability to respond effectively.

No combination of metrics can help you determine your company’s effectiveness at incident response. Data is just the starting point: it informs a hypothesis that company leaders need to confirm by using their eyes and ears. All of this together will help you build a realistic view of your company’s incident response. Remember: your goal as a company isn’t to reduce mean-time-to-recovery, it’s to learn from failure and build a more resilient organization.

The New Stack is a wholly owned subsidiary of Insight Partners, an investor in the following companies mentioned in this article: Real.

Feature image via Pixabay.

A newsletter digest of the week’s most important stories & analyses.