DevOps / Machine Learning / Security / Sponsored / Contributed

AIOps Isn’t Just a Pipe Dream, but the Tools You Use May Be

26 Jan 2021 9:33am, by

Michael Cucchi
Michael is the vice president of product at PagerDuty. He has over 20 years of engineering, product management, and marketing experience in the high-tech and software industries. Prior to PagerDuty, Michael was the vice president of software products at Cognizant where he drove strategy, funding and go-to-market methodology across a portfolio of 15 software as a service offerings, generated from a startup incubator he helped design. He has also spent time in leadership roles at Pivotal, Akamai and Riverbed in addition to running IT operations for a major data center for the federal government in research and special programs. At PagerDuty, Michael creates and drives PagerDuty's overall product and ecosystem positioning, product strategy, community advocacy and competitive intelligence.

In recent years, IT teams have been excited by the promise of AIOps. Faced with the burdens of rising IT complexity and growing alert fatigue, IT operations and development teams are spending more time than ever responding to incidents with digital services — which is cutting into their capacity to innovate.

COVID-19 has increased the pressure, with the number of digital incidents doubling in recent months. This has strained IT teams, who are entering into crisis mode — moving from one emergency to the next and spending hours, sometimes even days, in “war room” scenarios.

This type of chaotic response is problematic because IT teams are relying on manual diagnosis and remediation, and lack the insights required to predict, pinpoint and resolve incidents efficiently — or indeed identify the right people to act on them. In a recent study of ITOps and DevOps professionals, 61% said that they lacked the insights and data to aid rapid, effective incident response; while 50% say they are hampered by disparate data from multiple monitoring tools.

We see this play out on a regular basis in the real world. During a recent onboarding of a new customer, they told us how they had suffered through multiday troubleshooting sessions involving over 400(!) people. To make matters worse, not one of those 400 people was the right person to solve the problem. In order to create capacity to innovate, IT teams need tools that can help them manage incidents more effectively and tools that can be leveraged across an ever-growing complex of technology and people.

AIOps Can Create Capacity for Innovation

This is where AIOps can help. Gartner has summed up AIOps as the use of big data and machine learning to automate IT operations, helping to accelerate the identification and resolution of IT issues. The reason IT teams are increasingly excited by the promise of AIOps is that today’s systems, applications and people are highly distributed and constantly changing — generating massive volumes of data to the point where it’s impossible for humans to manually sift through and understand. This hampers both detection and resolution of problems and certainly can’t be done in real-time. AIOps can help you identify the needle in this growing haystack and do it instantly.

AIOps tools help to make sense of complex environments by integrating with applications and systems across an organization’s IT ecosystem. This enables AIOps platforms to proactively detect issues around the clock by correlating and clustering digital signals into actionable insights, making it easier for teams to predict incidents and making sure that if they are interrupted, it’s for a good reason. This helps accelerate root cause analysis and helps identify exactly who or what is at fault.  From there, the technology can trigger automated actions like restarting a server, clearing logs, or reverting bad deploys; empowering IT teams to spend their time on mission-critical, business-differentiating work instead of mundane tasks or responding to false alarms. Or at least, that’s the dream.

AIOps Tools Have Failed to Deliver

AIOps tools make sense of complex IT environments and help IT teams solve incidents more efficiently. But to make an impact, AIOps tools need to provide immediate value to users, and the realization of this value can’t take more effort than the work it aims to reduce. If it does, companies are better off investing in data scientists and customizing their own event rules and remediation solutions.

According to research from PagerDuty, 41% of ITOps and DevOps professionals have invested in an AIOps solution. But those same professionals say existing AIOps tools have not yet delivered on the promised benefits. This is because the ROI on AIOps takes a long time and configuring existing solutions is very complex. Of those that had implemented an AIOps solution, 54% said it required a lengthy services engagement. That is certainly a warning sign and why AIOps has not been more widely adopted.

The delay in time to value — let alone the large effort or expense to realize that value — isn’t sustainable. For AIOps to deliver on its potential, organizations need to adopt platforms that can quickly support IT teams and help them reduce the burden of digital operations — without requiring large upfront investments of time and effort. To do that it needs to be able to self configure, access rich and deep data sets, and bring together your business, your people and your technology.

Living The Dream: PagerDuty AIOps

The PagerDuty platform is built to overcome these barriers. Powered by machine learning and automation, PagerDuty’s AIOps solution delivers ready-to-use capabilities that power and improve incident response, with as little human effort as possible and with minimal complexity. This enables organizations to minimize implementation time, get quicker time to value and see real ROI.

To enable AIOps and machine learning, you first need a large dataset. Because we have a unique dataset amassed over the last decade across 13,000+ customers, the PagerDuty platform is able to identify and utilize patterns based on anonymized customer data. This data can be used to make intelligent recommendations on how to group alerts to reduce noise and how to continuously improve to prevent problems in the future. Teams are able to reduce alert noise out of the box with the click of a button, without having to spend months training an algorithm or relying on custom rules that can become harder to manage over time.

AIOps also requires a broad scope of observability sources to ingest data, drive insights, and deliver control to where your teams need it. PagerDuty acts as the central nervous system for an organization’s digital operations, aggregating data across tools into one place through the industry’s broadest (500 and growing) set of ecosystem integrations. And more critically, that data isn’t just about your technology stack — at PagerDuty, we link human, technology and business process data together. Finally, we can inject our platform insights and controls into your tools and ChatOps apps. That helps us not only predict incidents, but also allows responders to have all the information they need at their fingertips — wherever they are working and within the tools they use, whether that be Slack, Zoom, Teams, or into Zendesk or Salesforce for your customer service or success teams.

Once an issue is identified, PagerDuty brings together the right people with the right information in real-time, leveraging AIOps to build and maintain the world’s only cloud native and real-time service directory. This empowers teams with instant access to views of their digital operations and their associated dependencies — enabling our customers to address incidents in minutes and seconds, not hours. Responders aren’t just armed with the right information, they also have automated tools to pull in identified subject matter experts, communicate status to the rest of the business, and even fully automate remediation with tools like Rundeck.

Getting AIOps right is critical to helping teams address the burdens of modern-day digital operations. The right AIOps solution will free up your critical resources, not consume them.  AIOps can cut through rising noise levels, identify complex and ever-changing technologies and link those to human relationships. It is a critical toolset for getting predictive and automating incident resolution. Given our increasing reliance on digital services, teams need new ways to manage incidents that keep services up and running. But for AIOps to deliver on its value, organizations need to adopt a new breed of solutions that won’t leave them waiting months for implementation, years for ROI, and heavily tax teams on an ongoing basis. PagerDuty has designed our AIOps solution with self-configuration and ease of use in mind and connected it into our massive data set to help organizations tackle today’s increasingly complex, always-on world — while creating the capacity that teams need to innovate and drive the digital customer experience of tomorrow.

Feature image via Pixabay.

A newsletter digest of the week’s most important stories & analyses.