How AI and Automation Can Improve Operational Resiliency

When asked to define “operational resiliency,” Dormain Drewitz, vice president of platform advocacy at PagerDuty, recalled a recent conversation she had with Sam Newman, the noted O’Reilly author and technology consultant, about this very topic.
Newman cited the timeless sentiment of ”Tubthumping,” the 1997 global smash by Chumbawamba about surviving a boozy night at the pub: “I get knocked down/but I get up again.”
Resiliency, Drewitz said in this episode of the New Stack Makers podcast, is “about having the ability to bounce back and recover.”
But, she added, “to Sam’s point, it’s more than just a sort of technical backup and recovery type of conversation. It has to also manifest in organizational recovery.”
True resiliency, she said, also means surviving incidents with your collective willingness to take risks intact: “You are able to deal with issues when they arise and you don’t let that stop you from still trying to move fast.”
She spoke to Heather Joslyn, host of this episode of TNS Makers, about the role AI and automation can play in dealing with incidents, and how they can help to build and maintain operational resiliency.
This conversation was sponsored by PagerDuty.
How to Avoid ‘Squeezing the Balloon’
With teams being asked to become more productive than ever due to economic pressures on their organizations, automation aimed at developers — including automation fueled by the new wave of generative AI, such as code completion tools — has become more widespread.
But making developers more productive can create what Drewitz called a “squeezing of the balloon problem”: moving bottlenecks from one part of the organization (developers) to another (operations).
“We face a test in our maturity around DevOps, when suddenly you’re gonna give the people ‘throwing things over the wall,’ so to speak — they’re now got a forklift, and they’re throwing more,” she said.
What can automation coupled with AI do to improve the productivity of team members who aren’t frontend developers?
By looking at the entire value chain, Drewitz said, organizations can identify areas where AI coupled with automation can help. It takes “thinking about what happens after things get shipped to production and the teams that are involved there,” she said.
Tasks that would always be followed in various types of incidents — see if the Kubernetes API is responding, close a particular port, check to see if the database is loading, and so on — can be automated. “Those are all opportunities to automate because they’re small, but they add up in terms of the blast radius, from an interruption and productivity loss perspective.”
Using AI to Draft Postmortems
PagerDuty’s AI-powered operations platform is using generative AI to help automate repetitive tasks, with the idea of freeing engineers to tackle the causes of incidents and restore service more quickly.
For instance, users can now create automated runbooks to handle those repetitive, troubleshooting tasks that must be implemented in case of an incident.
And, Drewitz said, PagerDuty now uses generative AI to draft status updates during an incident. “You still get to look over and say, ‘Yes, that looks right,’ or ‘I would change that here.’ But not having that blinking cursor, when you’re in the midst of an incident? … it’s valuable for folks.”
It can also draft an incident postmortem report. “It can take time to go back and do that, go back and relive this battle we just fought and write it down,” Drewitz said. “ And if you pause and come back to it, then you may not have everything fresh in your mind anymore.”
Having an operations platform that can simply generate a postmortem draft at the push of a button helps save time and stress. “it’s so much easier to review and edit and approve something that’s been drafted,” she said. “Then you don’t have to be starting from scratch.”
Listen to the full episode for more on how automation and AI can improve incident management.