Move Away from Manual with Automated Incident Response
As companies ramp up their digitization efforts, there’s a lot of extra pressure from growing incidents, which in turn puts team health at risk with potential burnout and attrition. In fact, we saw a 19% growth in critical incidents from 2019 to 2020 on our platform, and initial 2021 data shows that this number has only gone up over the past year.
In addition to the increase in incidents, the timing of these incidents resulted in disruptions to both personal time and focused work time. According to platform data, teams experienced:
- 9% more off-hour interruptions.
- 7% more holiday/weekend interruptions.
- 5% more business-hour interruptions.
This additional strain is even more apparent when you look at our users’ working hours. Based on our data, we saw that users worked an additional two hours per day in 2020, which adds up to nearly 12 extra weeks per year! This is unsustainable and bad for team and responder health both long term and short term.
And there’s a toll here that extends beyond the individual. This extra work only increased attrition in an era commonly referred to as the Great Resignation. As teams lose people and look to hire for these open positions, their workloads only grow. Something must change, and IT leaders are looking to a few strategic initiatives to help.
We asked 700 IT decision-makers about their priorities for 2021 and 73% of technical leaders reported that they’re investing in AIOps and automation to help boost their operational processes and remove the burden of manual work. This is because nine out of 10 respondents said that traditional ITOps approaches are no longer able to keep up with today’s pace and complexity.
To relieve some of the pressure and adapt to this new era, 72% of leaders are ramping up digital transformation efforts and 74% are using DevOps to drive better alignment. The overall purpose of these investments is better, faster incident response with less toil for responders. Another way to think of this is that these aspirations set the stage for the category of automated incident response (AIR).
What Is Automated Incident Response?
Gartner introduced the term automated incident response last year as an evolution of the long-standing definition of incident response. This change reflects the growing need for teams to adopt automation and stems from the increasing complexity in our technology environments.
According to Gartner, “Automated incident response (AIR) solutions automate incident response processes by enabling centralized alert or incident routing. Using a policy or rule-based engine, on-call scheduler, or streamlined collaboration, this can improve operational efficiencies with action-oriented insights.”
At PagerDuty, we’ve been thinking about how automation plays a role in incident response for a long time. In fact, our standard solution has already met this definition since 2017. Our original on-call management capabilities provided scheduling. Our IT service alerting and escalation policies enabled centralized alert routing. And, our stakeholder notifications and response plays streamlined collaboration across incident response teams. That left us asking the question, “What’s next?”
Expanding the Value of Automation Across More of the Incident Response Life Cycle
Adding automation into the incident response process should be done strategically at points where humans are taking on undue burden. We’ve found it most helpful to visualize where automation could help ease this burden with these types of questions in the graphic below.
Some of these needs must be addressed by humans. After all, retrospectives can’t be completed with machines only; you need humans there to actually do the learning. But, other parts of this process don’t require humans as the first line of defense.
It’s apparent from our research that the traditional manual way of incident response is no longer enough to satisfy customers and is too toilsome and exhausting for responders to maintain. Humans are burning out addressing issues and completing tasks that machines could be resolving without intervention. The gap between the manual and automated processes is becoming more painful.
The manual way involves interrupting humans from whatever they’re doing, whether that’s walking the dog, sleeping, or focusing on the next key project, and asking them to find the root cause all by themselves. But the right humans at the right time is no longer enough. Teams need to lean on automation to manage this increased pace and complexity.
The new automated way is about preventing humans from being the first line of defense. It introduces ways to leverage machines to shoulder some of the burden and help humans balance critical workloads. And it works in real time or on demand to address a multitude of use cases based on what each team is ready for. Sounds great, so how can teams actually get started?
How Do I Get from Manual to Automated?
The journey to automated incident response is one that can’t be completed overnight. When it comes to automation, it’s important to focus on reducing operational loads to get more done while at the same time increasing organizational speed and innovation. Teams can do this with a crawl-walk-run approach. The key is starting where your organization is today and having a plan for continued maturity.
At PagerDuty, we mark operational maturity across five stages ranging from manual to preventative. One of the most important parts of the journey is understanding where your team and organization is on this model currently. Then you can start by picking a specific area to focus on.
For organizations in the manual and reactive stages, you can identify and enable those in your organization with an affinity for automation. Leaning into automation can feel daunting, so encourage people to use the skills and languages they already have to keep it feeling familiar.
Teams in the early stages of operational maturity should favor action and focus on turning manual documented steps into automated steps. Once you’ve done this, you’ll have pockets of automation across your organization that make your subject matter experts more effective.
When teams reach the responsive stage, the objective becomes to standardize the incident response process and enable self-service. Standardization helps you build automation that you can reuse across teams and services. Self-service is how you leverage automation for greater value by enabling others to do what previously only your subject matter experts could.
Standardization and self-service distribute the operation load, provide more effective use of resources and enable SMEs to get out from under toil and focus on what moves the business forward. Incidents will be resolved much more quickly because first responders have the tools they need.
In the proactive stage, automation is optimized for real-time work. This means running automation in response to incidents, creating auto-remediation capabilities and removing more of the real-time burden on the teams responding to critical work.
People capacity is an organization’s most precious resource. The best way to protect team capacity is to resolve as much as you can without human intervention. It’s not about replacing humans; it’s about augmenting your humans with automation that keeps repetitive or noisy tasks away from them so they can focus on innovating.
When teams can effectively link automation with their incident response processes, they benefit both in terms of fewer total incidents and shorter response times. This ultimately means less firefighting for teams, less burnout, less attrition and more time spent innovating. Doesn’t this truly sound like a breath of fresh AIR?