PagerDuty sponsored this post.
The digital world can be pretty unforgiving. After two years immersed in online services, consumers now expect a seamless experience from their providers — whether it’s online banking or video streaming. And they expect that when problems do occur, they’re fixed rapidly. The challenge is that the IT world is still largely running at ticket speed. That means when incidents occur, users must submit a ticket and eventually a member of the IT team will pick it up. It’s manual, reactive, inefficient and not nearly fast enough in today’s digital-centric world.
Organizations instead need to move their digital operations at machine speed. This means adopting process automation to accelerate and streamline incident response, preserving the bottom line and customer loyalty.
The Problem with IT Incident Response
Global IT spend has grown more than 5% in 2022, to reach nearly $4.5 trillion, driven largely by “high expectations for digital market prosperity,” according to Gartner. Yet as more revenue and customers shift online, the stakes for incident response continue to rise. Today, it’s more important than ever that issues be tackled rapidly — ideally before customers have even noticed something is wrong. But troubleshooting and incident resolution is increasingly challenging for IT operations teams as digital infrastructure becomes more complex and interdependent. Both operational incidents and demand for innovation are surging, pulling developers in two different directions.
The sheer volume of data that responders must collect, analyze and act on is becoming unmanageable without a high degree of automation. Responders must sift through the noise to prioritize what matters, and who should respond, all while orchestrating across multiple siloed systems, which may require manual rule configurations and other time-consuming steps.
In this context, ticket speed is woefully inadequate. At best, an ITOps responder will be able to get to work on a ticket as soon as it comes in. But running digital operations at human speed isn’t sustainable — or quick enough. The answer is workflow automation to accelerate and improve incident management outcomes and free up team members for innovation.
Intelligence and Automation
What does this mean in practice? It means adopting capabilities such as runbook automation (RBA), which captures and automates some of the most common, repetitive tasks used to diagnose and remediate recurring incidents. For example, automating diagnostics to check disk, memory, CPU and memory health, and gather logs. With RBA, first responders can handle more incidents without needing to escalate, resolving incidents faster and ensuring that subject matter experts (SMEs) can work uninterrupted on improving the user experience. This can save time and money and improve staff satisfaction.
Another critical driver of machine speed digital operations is automated noise reduction and alert grouping. The first is essential in a world where a major incident like a core dependency failure could generate thousands of alerts — many of which could be irrelevant or duplicates. By using AIOps tools based on machine learning, organizations can silence the alerts that require no response, filtering out anything irrelevant, nonessential or duplicated. These tools can pause flapping incident notifications for a certain period of time while working on a serious incident. They can also automatically group alerts based on alert content, time period, past groupings and any additional custom thresholds.
As these tools learn more about the digital operations of an organization, they could eventually handle incidents in a completely automated fashion, from diagnostics and system health checks to triggering self-healing actions.
Driving Faster Coordination
Customer service teams can also play a critical role in incident response by taking the pressure off SMEs. According to research, the volume of interruptions increased 4% from 2019 to 2020. The number of users being interrupted varies according to the size of companies — with 46% of users at very small companies being interrupted compared to 30% of enterprise users.
These interruptions can be costly for companies that are saddled with manual “ticket speed” processes and tools that are siloed from those used by IT and engineering teams. Many are forced to pass issues up to the next tier without useful context, adding time to each inquiry. What’s more, a lack of two-way communication means the service agent has no confidence they’ve escalated to the right subject matter expert and no way of receiving status updates.
However, things are changing with more organizations embedding their customer service teams into incident resolution lifecycles so that they own a case from beginning to end. That means they’ll delegate tasks, listen to subject matter experts and proactively coordinate a response. But to do this effectively, they need automated and unified toolsets to orchestrate and scale a rapid response.
These will automatically surface historical context when an incident occurs, including technical monitoring data and information from customer calls and other systems of record. They will enable automated escalation of some incidents according to predefined policy and bidirectional comms to ensure staff can quickly mobilize and activate a response. And they’ll leverage machine learning to flag issues proactively for next-level customer service and differentiated VIP support.
Running at Machine Speed
As with the IT use case, running at machine speed is all about accelerating and improving incident response to keep customers happy and reduce the risk of employee stress and burnout. At a time when more workers than ever are reconsidering their roles and careers, organizations ignore the latter at their peril.
But moving at machine speed isn’t just a matter for customer support and IT teams. Stay tuned for the next part of this two-part series when we’ll be looking at its impact on other functions across the business.
Featured image via Pixabay.