For too long in the tech industry, incident management and incident response have been siloed within the site reliability engineering (SRE) team. At first glance, this can seem totally reasonable — their job is to mitigate failure and improve the health of technical systems.
But let’s think about what an incident actually is — a “SEV1” production outage is any event that’s serious enough to stop whatever you’re doing to focus on it. A classic example of a SEV1 incident in everyday life would be a fire, and a lot of the learnings in this space actually come from fire departments (more on this later). The fact is that there needs to be dedicated tooling and processes for these events that span across company lines — they can’t live within a task management system like JIRA if your company wants to really get good at incident response.
So why do we keep the authority to declare an incident behind lock and key, within the SRE team? The fact is that when there is a fire anyone should be able to pull the alarm. It’s a huge myth within the tech industry that the majority of incidents are caught within our technical systems and SRE tooling, e.g. monitoring, alerting, AIOps, etc. The truth is that if your technical systems can detect the problem, they very likely should be able to auto-remediate the problem as well. SEV1 incidents, by definition, are unique, black swan events that require human attention and coordination.
According to Atlassian’s The State of Incident Management Report, the lack of coordination across departments is the biggest pain point for organizations when it comes to managing incidents. We have to stop thinking about incident resolution as an SRE-only practice. PR can have an incident, legal can have an incident. And very often solving these problems requires cross-team transparency and cooperation. And it’s worth noting that, even the incidents that do start from an SRE declaration, still often require multiple other people and teams to be involved.
Did that outage impact a high-priority customer? Did the marketing website go down? Expecting SREs to not only handle these failures but also coordinate the people process of who needs to be involved, is asking too much of them. There should be tooling in place that automates the human response to emergencies, which is what we are building at Kintaba.
So why exactly are most incident response tools focused on SREs?
This is a bit of a complicated question. One of the biggest reasons is that a lot of the best practices within SRE come from Google, which has a strong engineering culture and are a leader in everything from cloud infrastructure to distributed tracing. But the interesting caveat here is that if you actually read their chapters on incident response, they write much more about coordination and people process (e.g. on-call rotations, incident commanders, blameless culture etc.) than they do about alerts or AI.
It’s a huge myth within the tech industry that the majority of incidents are caught within our technical systems and SRE tooling.
But it’s not terribly surprising that, with the rise of the SRE persona, came a bunch of incident response tooling that focused on this demographic. These are engineers who are knee-deep in thinking about failure every day, and who have dashboards and alerts to keep them informed about the health of the online systems. In some sense, it feels quite natural for incident response to just be an extension of this machinery.
What’s often overlooked is that even at Google there are high profile efforts to extend the SRE practice, even so far as to share pagers with customers through a practice they call CRE: Customer Reliability Engineering. With CRE, Google goes so far as to share pagers with participating customers, allowing the customer, not the engineering team, to define when an incident is occurring, what its severity level is, and most importantly: when it’s closed.
In fact, if we step outside of the tech industry for a moment and look at industries that birthed many of our incident response practices in the first place — aviation and fire departments — we realize that these very practices were born as cross-functional operations that directly involve not only all parts of the responding entity, but the customers as well. These are folks dealing with failure at the highest stakes. Lives are on the line when things go wrong in these industries. By the time the fire department is called, all other precautions that were put in place have failed and human intervention is absolutely critical.
What the Airlines Can Teach Us
Since the 1950’s, the airline industry has defined and shaped best practices for incident management. In fact, what’s meant by “modern incident management” is the predictable and repeatable approach to handling unexpected disasters. As an industry we have taken a lot of learnings from aviation, whether we realize it as technical leaders or not.
One of the most important findings to come out of aviation is that more incidents leads to fewer catastrophes. This may seem counter-intuitive and is a great example of perverse incentives — because it’s certainly natural to want the number of incidents to go down, but by trying to push that number down, you may be closing out incidents before they are actually resolved, or even worse not declaring them in the first place.
To paraphrase Charles Goodhart, any metric created will ultimately become the goal. So if the incentives are aligned to reduce incidents, then we’ll see fewer recorded incidents that follow best practices and instead experience “hidden” incidents where, in the worst case, the lack of process causes the organization to not learn and the incident repeats again in the future, only this time with more dire outcomes.
When we talk about having a positive incident culture, we mean embracing the learnings of aviation to bring more people into the process and lower the barrier to declaration. Instead of considering incidents as something to be avoided, they become events to be embraced and learned from. This shift in mindset means filing more incidents becomes actually incentivized, and these metrics are never used as a justification for punitive action. This approach has resulted in the safest decade of commercial aviation on record, with over 12 billion passengers traveling and no fatal accidents.
Next Step: Customer Success
Ultimately, I think modern incident management will need to become a company-wide practice within the next five years. Practically, I think the most obvious next step outside of SRE is the customer success team.
Customer success has been a rapidly growing and adopted discipline since the mid-2000s within the tech industry, where SaaS providers are spending loads of money and effort on customer acquisition, but then the complexity of the SaaS offerings can cause those same customers to eventually churn. In response, many companies began actively targeting at-risk accounts with “dive and catch” teams designed to increase retention by helping customers derive more value from their products. According to McKinsey, in its report titled Introducing customer success 2.0: The new growth engine…
Many businesses created formal customer-success functions to take a more proactive approach to churn reduction. These efforts helped transform customer success into an emerging discipline in the software industry, complete with new tools and methodologies. Companies also created additional roles to support this function, most notably that of customer-success manager (CSM). According to one McKinsey study, the SaaS vendors with top quartile revenues achieved their strong showing by investing more in customer-success initiatives aimed at churn reduction.
The rise of the Customer Success Manager (CSM) is a surprise to no one, especially within the tech startup universe. Losing even one single customer can be devastating to a startup, and so business leaders know that if a customer is unhappy — that’s a big problem. And it’s indicative of a SEV1 much more than an alert from a monitoring dashboard. As I stated earlier, the truth is that the majority of incidents are customer-centric and are often triggered by the customers themselves — not our technical systems.
From talking with customer success teams at various organizations, we found that the tooling to respond to these incidents is still quite poor and disparate. They often have to field the customer complaint via email, phone, or slack — and then escalate the complaint to another team, who often have a completely separate tool like Zendesk, where the CSM often doesn’t even have a seat. And then the authority to declare the incident still remains solely within the SRE team, where it’s a struggle for both the customer themselves and the corresponding CSM to stay up to date with progress on the incident resolution.
The industry needs to shift incident response left closer to the customer. Why not let the customers declare incidents for themselves? Why should various teams have various tools for coordinating and mitigating the same incident?
We need to move to a place where incident management is considered much broader than SRE tooling. Unfortunately, all of the efforts seems to be going towards bringing technical systems closer to incident response. A good example is the swath of observability players that have recently announced incident response solutions. But in my opinion — this is the wrong approach. Modern incident management needs to break out of the silo of SRE to be truly effective.
If you’re interested in modern incident response, check out the recorded talks from IRConf — the first-ever conference dedicated to incident response.
Image by Renee Gaudet from Pixabay.