Take The Human Out of 3 A.M. Incident Responses
Responding to incidents and deciding how to take the right actions when they occur represent one of the major pain points in the DevOps world today.
Not only do many unlucky site reliability engineers and operations team members face getting that 3 a.m. emergency wake-up call or page, but significant amounts of resources, time and effort often continue to be wasted when proper processes and incident response protocols are not in place.
Improving upon, and especially, automating event responses was one of the major themes of the recent PagerDuty Summit 2022 user conference. The venue also served as a springboard for the announcement of and discussion about the major expansion of PagerDuty Operations Cloud capabilities.
Incident Workflows and Automation Actions
These updates help to ensure that modern work realities and the systems teams are aligned, Sean Scott, chief product officer for PagerDuty, said. Operations Cloud’s new incident-workflows capabilities help to solve the challenge of how to quickly and repeatedly handle incident response, he said.
Best practices and team learnings have already surfaced regarding what steps responders should follow in various cases, such as creating an incident-specific Slack channel or sending a stakeholder update. But how do teams ensure that those steps are followed in the heat of a high-severity incident?” Scott told The New Stack. “Some teams might have a wiki page with a checklist, but remembering to bookmark and then check the wiki becomes a point of failure in the process, especially when onboarding new team members. With Incident Workflows, teams can define the steps they want to be followed in different cases and what should trigger them.”
Oftentimes, an incident is classified as a certain priority or urgency. The process should be such that there is no ambiguity about the action that needs to be taken. When such events occur, a checklist of steps is followed automatically, “the same way every time,” when Incidents Workflows is implemented, Scott said.
The Automation Actions capability solves the challenge of who and how automation is activated, Scott described. For example, running a simple network or database test is often part of troubleshooting a problem, but the permissions and know-how of how to run such a test exists with a specialist team.
“During an incident or customer support interaction, escalating to a specialist team to run a routine test takes time and adds to the cost of the incident in terms of increased duration and more people working on the issue,” Scott said. “With Automation Actions, first responders and customer service agents can take action directly and run an automated diagnostic test, for example, to validate or troubleshoot an issue.”
Event Orchestration is also a major component of the Operations Cloud release. “Teams can even define a logical flow that triggers a step, like running a test or restarting a node, without any human intervention,” Scott said. “This accelerates incident resolution and reduces the number of teams and cost that are involved.”
PagerDuty said that the benefits that Incident Workflows offers that will streamline the chain reactions that occur during an incident and make the response process more rapid and consistent include:
- Workflows that are easy to design and activate with no-code capabilities.
- Automated sequences for common incident actions.
- Customizable workflows that enable more consistent responses across the organization.
The benefits associated with Automated Actions to orchestrate automated diagnostics and remediation steps include:
- Immediate automated actions triggered either by PagerDuty’s Event Intelligence or manually by responders and automated diagnostics to investigate status, gain context or directly initiate runbook automation to remediate an incident.
- Allowing customer service agents to validate customer issues by running automated actions directly from the PagerDuty application in the Salesforce Service Cloud, thus reducing resolution time and the number of incidents escalated to back-end teams.
PagerDuty's Frank Emery on @PagerDuty's platform's incident response automation: "50% of knowledge reduction that happens is done with suppression, where people use event orchestration to target flows where they know events aren't really adding value.” @thenewstack #PDSummit22. pic.twitter.com/SeZHBjfXpA
— BC Gain (@bcamerongain) June 10, 2022
As part of the Event Orchestration, automated reductions of incidents and events are downgraded so that more incidents and other tasks are given more priority. Also known as “suppression,” this capability should help to alleviate a major pain point that operations teams have struggled with.
“This is what event rules are built to do originally, and what Event Orchestration really cranks up to 11. And so I would say maybe out of 50% of all knowledge reduction that happens is actually done with suppression, where people are using Event Orchestration to target different sorts of flows where they know events aren’t really adding value,” Frank Emery, senior product manager for PagerDuty said during a summit talk. “What’s interesting here is what you can do with the basic tier vendor orchestration … You can start to really weed out some of those noisy events that are not adding value for your team.”