Favorite Social Media Timesink
When you take a break from work, where are you going?
Video clips on TikTok/YouTube
X, Bluesky, Mastodon et al...
Web surfing
I do not get distracted by petty amusements
DevOps / Security

Managing Complexity and Avoiding Chaos in Digital Operations

One enterprise may use thousands of digital services, making critical incidents inevitable. Proactive, real-time response is essential.
Aug 9th, 2021 7:19am by
Featued image for: Managing Complexity and Avoiding Chaos in Digital Operations
Featured image via Pixabay.

Sean Scott
As chief product officer of PagerDuty, Sean is responsible for its multiproduct digital operations management platform. He has more than 20 years of experience in the technology industry, with the majority of that time at Amazon. Sean holds a bachelor’s degree in computer science and an M.B.A. from the Red McCombs School of Business, both from the University of Texas at Austin.

The pandemic has pushed many organizations over a technology tipping point, forever changing the way business is done and highlighting how few companies are truly equipped to meet the demands of this new digital reality.

A single enterprise today might rely on thousands of digital services to deliver the digital experiences key to retaining customers. In this context, critical incidents are inevitable. It could be a service issue such as downtime, with a major impact on customers and businesses. Or it could be a crucial delivery, with status assured at multiple touchpoints.

This is why proactive, real-time response is essential. When done right, issues are fixed before affecting any customer. Yet many organizations are still encumbered by manual incident routing and inefficient cross-team collaboration. This leaves them trapped in a vicious circle of constant firefighting, which affects brand reputation, well-being and productivity.

To provide insight into how organizations are coping with these pressures, PagerDuty has produced the first The State of Digital Operations report. The report is based on an analysis of data compiled on our platform between January 2019 and April 2021. The platform ingests around 30 million events per day, filtered into around 1 million alerts and over 55,000 critical incidents.

The Cost of Critical Incidents

Critical incidents surged by 19% from 2019 to 2020, and the volume will continue to rise as organizations accelerate digital transformation projects and technology stacks become more complex. This is true across almost every industry vertical, however critical incident volume was highest over that period in travel and hospitality, and telecoms (20%). This is understandable given the large number of cancellations in the former and the huge pressure from remote work on the latter.

The time spent fixing issues also adds to the cost of an incident. Even though PagerDuty customers are resolving individual incidents faster than before, critical incident volume is increasing at such a rate that the total time spent fixing problems is still increasing. If the average IT operations/developer in the United States is paid $50 an hour, and it takes 1.2 responders around 2.1 hours to resolve a critical incident, we’re looking at $126 per incident in hourly costs alone. An average of 105 critical incidents per month in 2020 works out at nearly $159,000 per organization per year. (That’s before incorporating lost revenue or impact on staff productivity and morale.)

The Human Impact

The human impact of growing incident volumes is too often overlooked. It’s causing employee burnout and churn, and our analysis revealed more than a third of users worked two extra hours per day in 2020 versus 2019. That’s 12 more weeks of work over the course of the year. The total number of interruptions — push notification to mobile phone, SMS or phone call — also rose over the period by 4%. It’s particularly acute in very small businesses, where 46% of users are interrupted each month versus 30% of enterprise users.

We also saw a 9% increase in “off-hour” interruptions (between 6-10 p.m. Monday to Friday or 8 a.m.-10 p.m. over the weekend) and a 7% increase in holiday and weekend interruptions. These interruptions cause responders annoyance, inconvenience and stress. When cross-referencing interruptions data with information about users who left their jobs, it’s not surprising that users departing their roles experienced higher than average off-hour incident load — every 12 days compared to every 15 days for other users.

Taking Full Ownership

Amid the increasing pressure and rising incident volumes, there is some good news. The longer customers use the PagerDuty platform, the more adept they become at relieving the burden on technical teams and unlocking faster mean time to acknowledge (MTTA) and resolve (MTTR).

Accountability is key to driving this kind of operational excellence. The best-performing teams take ownership and acknowledge issues quickly, even if it takes time to investigate and resolve them. The percentage of critical incidents acknowledged (Ack%) tends to rise over the lifetime of a customer account as operational maturity improves. Typically, this starts out at just 16% after three months, then soars to 82% after 60 months. That means compounding gains in collaboration efficiency and protected revenue for organizations the longer they use PagerDuty.

Strategic use of technology is also important to continuous improvement. The use of chat tools such as Slack and Microsoft Teams to distribute and coordinate real-time work (aka “ChatOps”) improves collaboration and visibility. We’ve seen ChatOps increase by 22% as a result over the past year. Mobile adoption is also driving the productivity of incident responders. Regardless of size, organizations with higher mobile adoption rates saw 40 to 50% better MTTA than those with lower rates.

The Journey Starts Here

As companies accelerate digital transformation and embrace modern technology stacks, they will be under more pressure than ever to deliver on customer expectations. Complexity and noise show no signs of slowing. The rise in critical incidents that require real-time work to resolve is placing stress on responders.

However, help is here. Operational processes and intelligent platforms like PagerDuty can reduce the burden on teams and empower the enterprise to unlock faster MTTA and MTTR. Businesses need a sustainable way to allocate resources to manage digital operations. Not only will this maintain digital services continually, it will help you retain customers and improve employee retention.

Even as the complexity of technology drives ever-increasing incidents, the people managing that digital infrastructure remain at the core. Companies that can mature their approach to digital operations will balance the workload for teams. This will keep those teams happy and productive while helping to deliver the highest quality digital experiences.

However, none of this can happen at the flick of a switch. So with the right foundational technology providing unified visibility, intelligence and control, organizations will be well on their way.

Group Created with Sketch.
TNS owner Insight Partners is an investor in: Pragma.
THE NEW STACK UPDATE A newsletter digest of the week’s most important stories & analyses.