TNS
VOXPOP
Where are you using WebAssembly?
Wasm promises to let developers build once and run anywhere. Are you using it yet?
At work, for production apps
0%
At work, but not for production apps
0%
I don’t use WebAssembly but expect to when the technology matures
0%
I have no plans to use WebAssembly
0%
No plans and I get mad whenever I see the buzzword
0%
Operations

You’re Wasting Your Time with Incident Management

What keeps your systems running? People do. No matter how sophisticated your technology, it will only succeed with a social system designed to support it effectively.
Jan 9th, 2023 10:00am by
Featued image for: You’re Wasting Your Time with Incident Management
Image via Pixabay

There are a lot of incident management tools coming to market. Some have red knobs, some have blue knobs, and they do many things: track incidents, count incidents, classify, categorize and report on incidents. Most provide beautiful graphs to show you the number of incidents that occurred last week compared to last month. They each have bells and whistles that look different, but when you listen, they sound the same.

This chorus of existing tooling focuses on reacting after something breaks and collecting and reporting questionably accurate data on details like services affected and how long the event lasted. So take your pick of the many options; your ROI will be roughly the same. The potential of these tools is limited by their perspective — not because of any features they may or may not have. 

Not to say being reactive isn’t helpful — we want to respond when things go sideways. However, in our experience, these reactive incident tools often feel like more of a chore than a partner in helping you improve outcomes for your organization. Are these tools just helping you find and fix issues, or are they helping you learn, adapt and continuously improve at handling surprises you could not have seen coming?

For a category of tools marketed to help keep your systems running, they all miss an important point: What keeps your systems running? People do. 

No matter how sophisticated your technology, it will only succeed with a social system designed to support it effectively.

Technical systems are becoming more complex, and the pace of change for organizations is only increasing. People in your organization take on the burden of coping with this complexity and change. Maintaining success depends on understanding and improving the environment and the pressures on your people.

Jeli is built from this perspective. Our tooling helps you develop a practice of incident analysis to better understand how work is done and what factors contribute to your organization’s failure or success. The rich insight you gain through incident analysis will help you proactively identify brittleness in your organization and address these things before an incident occurs.

We also make it easy — automatically surfacing these organizational insights out of the box in what we call our Socio-technical Learning Center.

Instead of focusing on traditional metrics such as reducing incident count and Mean Time to Recovery (MTTR), we should be thinking about how incident management tooling helps us better understand our people, their challenges and constraints. Your incident management tooling should be able to answer questions such as:

  • Where are people over taxed and at risk of burnout that could contribute to system failure? 
  • Where have you had heavy turnover, leaving critical systems supported by engineers with minimal tenure?
  • What services were the engineer who just left the organization actually supporting versus what you thought they were supporting?
  • What skills do you need to maintain your services? Where in your organization are you losing these skills?
  • Who is being frequently called in to incidents when not on call?

The success of your next launch is about more than just your applications and infrastructure. It’s about bringing together the right skills, how your teams work together and how engineers balance quality with deadlines. It’s about sleep, hours worked and stress. On the flip side, your next incident will be about these factors as well.

Does your incident management tooling help you understand all of these contributing factors, or are you stuck at an oversimplified root cause?

Group Created with Sketch.
THE NEW STACK UPDATE A newsletter digest of the week’s most important stories & analyses.