The Benefits of Declaring More Low-Severity Incidents
At many companies, incident is a four-letter word. An incident is an emergency; everyone must drop everything and extinguish the fire. As a result, engineers are only willing to declare an incident once they absolutely have to. Nobody wants to cause a stir.
But what if we took the stigma out of declaring an incident? In reality, “incident” simply means something happened. How your team reacts after that declaration depends on your incident management philosophy and processes.
To put this to the test, in August 2022, we at FireHydrant publicly stated that we would encourage our team to declare more low-severity incidents. We did this with the dual intention of strengthening our incident management practices and getting to know the ins and outs of our product better.
Running More Low-Severity Incidents Is Improving Our Culture
The results were transformative for our philosophy and confidence around incidents. We’ve found that normalizing declaring low-severity incidents increases psychological safety, builds important response muscles and helps surface problems before they become major.
In this blog post, I’ll talk about what we did, how it changed us so far and what’s next.
Lowering the Barrier to Incident Declaration
We found that opening ourselves to more low-severity incidents was more of a cultural shift than a process change. We took a two-pronged approach to encourage engineers to feel more confident. First, we expanded the scope of what we considered an incident, and second, we created a new severity type to classify them.
This started with a new team philosophy: if something looks weird, just call it an incident, no matter how minor the situation. To facilitate this, we created a lightweight new severity type called investigation with the simplest possible runbook condition: create a Slack channel to capture stream-of-consciousness notes on the subject and monitor what happens next. If it became clear there was a more significant issue, it was easy to escalate because the information was already documented. If severity didn’t evolve, we still had context on the health of our systems for the future.
Nine Months on, More Incidents and a Mindset Shift
Nine months later, we’ve declared dozens of incidents and learned a lot. Low-severity incidents currently make up 75% of our overall incidents. That’s 139 incidents whose learnings would have gone uncaptured, had we not instituted this goal. Here are some of the benefits we’ve noted:
Positive Effect on Team Mentality
We’ve seen a shift in the way our team defines an incident. While at the beginning of our experiment, team members were resistant to calling an incident for something minor, over time, they got used to the idea.
As a result, big and small incidents are becoming less stressful and scary for the team. We practice our incident management skills with low-severity incidents to be more confident when something bigger comes along and to keep a continuous finger on the pulse of our systems.
Better Time Management
While it may seem counterintuitive to say that declaring more incidents has resulted in better efficiency, we have found that true. We realized teams were already doing the work around resolving these “incidents” daily. They just weren’t recognizing it as such.
But by finally calling these incidents “incidents,” we put a name to work that had previously been invisible. Naming it moved it from being a distraction to a low-cognitive load task. It was easier to fit into people’s workloads.
Calling out these small tasks as incidents helped keep them small. And unlike a higher-severity incident, it doesn’t distract other engineers from their work because they aren’t tagged in unless absolutely needed.
Problem Solving Earlier in the Process
Occasionally, we witness an incident that starts as a low-severity peculiarity but ultimately affects customers, in which case we bump up the severity level. Encouraging more low-sev incidents gives us a venue to surface these issues, saving us time and enhancing the customer experience by decreasing the risk of downtime.
Enhanced Knowledge Management
Previously, if an engineer encountered a small error or problem, they might fix it and move on with their day, telling nobody. What they did and how they did it was never recorded, so looking for trends or learning from the experience was impossible.
Now, if a low-severity incident is declared, there’s documentation. Even if it’s a problem that an engineer can handle solo, all the context, and the steps they took to resolve are all recorded in Slack. They can add charts they looked at, alerts they say, and a running history of everything they think contributed to the problem (even red herrings). The next time something similar occurs, it’s easy to go back and consult the records.
Documentation is critical to enhancing our incident management process because it ensures that information is freely available to those who need it and not siloed away in a few superstar engineers’ heads.
What Comes Next?
We’ve experienced many benefits from putting this philosophy into action. And one of those benefits is giving us insights into where we still need to improve. Here’s what we’re focusing on next when it comes to incident declaration at FireHydrant.
We still see that everyone feels like they need to join in when they see a notification of a new “incident.” These team-wide notifications are a distraction, and our goal going forward is to ensure new incident notifications only go to people who need to know. We’ll lean on a new internal status page to keep others up to date and focus on being more explicit about who needs to be involved during an incident.
Role-Focused Response Teams
To help remove that expectation that everyone needs to participate in an incident, we aim to bring people in more explicitly, detailing the role that’s expected of them. By doing this, we can be more direct in reminding observers that there’s no expectation for them to participate, and use our on-call rotation to decide who to hand incidents off to.
We will also continue to revisit our runbooks consistently to ensure the right balance between an unstructured free-for-all and an overly prescriptive process. Neither extreme is useful — we want to encourage creative problem-solving within the bounds of structure. One way we’re doing that is by using our platform for retro commenting features. This allows us to always have a retro but to not bog people down with meetings.
Changing Attitudes Toward Incidents Comes From Above
Many small things, done well and consistently over time, lead to the most positive incident management culture. It’s up to leaders to give their teams the space to learn, and to fail, in a low-stakes environment.
When pursuing this goal, the technical change is easy; the hard part is changing employees’ philosophy toward incidents. You can’t force people to put themselves out there; you must show them the way. Demonstrate by example, celebrate those who take on the challenge and showcase the benefits.
The truth is every incident, no matter how small, matters to someone. Leaning into these low-severity incidents helps us grasp the small details about our product and incident management process. It also gives us a deeper understanding of how our customers use our services.
Learn more about maturing your incident management process with our new ebook: How to improve your incident management program in 2023.