The following is the second in a two-part series exploring the emergence of AIOps.
It may be a relatively new term, but the flood of alerts and the complexity of modern application stacks are driving many IT teams to adopt AIOps.
A recent survey from the AIOps Exchange illustrated both the scale of the problems AIOps was meant to solve and the level of interest in solving them, with 91% of those in the survey looking at machine learning-powered tools to make ops teams more productive. That’s even higher than Forrester’s figures for adoption of what it calls “Intelligent Application and Service Monitoring,” with 51% already using and another 21% planning to adopt them within a year.
Interest is so high because some 40% of IT organizations in the AIOps Exchange report see over a million event alerts a day, with 11% receiving over 10 million alerts a day. The fact that a quarter of the organizations have 50 or more monitoring tools in their enterprise may account for the sheer volume, but so do the number of different services and platforms in use, with many enterprises handling both legacy applications and new microservices.
AIOps tools promise to reduce the noise by correlating those alerts together into related incidents, by collecting time-series data, building machine learning models to aggregate them and — in some cases — automating collection of further related telemetry. Topology mapping discovers the relationships between devices, or between applications and resources, and statistical analysis ranges from simpler outlier and anomaly detection — Is one node behaving differently from the others or is the performance of a monitored resource abnormal even if it’s within its normal range? — to more powerful multivariate analysis and dynamic baselining. Dashboards and incident visualizations show performance metrics and event timelines together.
They may also be able to detect probable root causes, find the right people to work on a problem, suggest remediations or automate fixes and predict future problems.
Silos and Swamps
What ScienceLogic CEO and founder Dave Link calls “the alert swamp” is only part of the problem IT is dealing with. “When you have dozens of tools, you end up with a data swamp rather than a really clean data lake, because the real data you need fits into many different data stores,” Link warns. That means AIOps tools need to clean and structure data to get it ready to analyze.
By their nature, AIOps tools need to collect data from and automate remediation actions through existing IT operations tools, whether that’s database logs, infrastructure monitoring (networking, storage and compute), APM and the application layer, cloud monitoring, cloud services, orchestrators like Kubernetes (and the microservices running in the containers), or configuration management systems.
But existing operations and monitoring tools are also adding integrations or their own AIOps tools. There’s a Moogsoft AIOps plugin for AWS Systems Manager OpsCenter and you can connect VirtualWisdowm to AppDynamics to get problem detection and remediation for monitored apps. New Relic recently acquired SignifAI and is adding its automated issue correlation and enriched context information for incidents to the New Relic One platform, with integration to common devops tools.
That’s going to help software teams predict and address performance issues, SignifAI co-founder Guy Fighel told us (he’s now general manager of AIOps and vice president of product engineering at New Relic).
“As the complexity of production systems grows, on-call teams need faster and easier ways to resolve incidents. They need assistance and automation that augments (not replaces) their existing incident management teams and workflows, so they can detect, diagnose and resolve issues faster, as well as prevent problems before they occur,” he said.For DevOps and site reliability engineers, he claims that means being able to “detect issues earlier, reduce alert noise, and deliver highly available and reliable software at scale.”
SignifAI, which has been available either as SaaS or as a plugin for incident response systems like OpsGenie and PagerDuty, integrates with around 60 monitoring, incident management and alerting tools. It automatically correlates information from those multiple tools but also allows users to fine-tune the correlation engine.
AIOps Adoption and Maturity
In its 2019 State of AIOps report, OpsRamp found AIOps tools are being commonly used for intelligent alerting, root cause analysis, anomaly detection, capacity optimization and automatic remediation of incidents. Abut “80% of IT leaders are looking to automate tedious tasks within incidents,” OpsRamp senior vice president Bhanu Singh told us. They also want to bring down costs. “Most of our customers are looking at significantly cutting the cost of level one support, cutting the costs of alerts by 30% or 40%.”
Those can be significant, Link told us. “Service disruptions and downtime costs, on average, $300,000 per hour, and as much as $540,000 per hour.”
But what organizations want from AIOps can vary on whether you talk to IT teams or developers, he notes. IT organizations want context. Ops folks want overall visibility into their systems to get a better handle on exactly what root cause could be for a full production environment,. The dev side of the house is looking for a much faster real-time tool: they want a real-time view of how the application is performing from an end user perspective.
ScienceLogic uses a five-stage maturity model to help customers understand where they are on their monitoring and automation journey — starting with just data, ScienceLogic Chief Technology Officer Antonio Piraino explained. “Stage zero is completely human powered: usually decade-old systems with siloed isolated pieces of data where I don’t really understand the context of it, I don’t have great analytics against that: I just have plain vanilla alerting against a hard policy with no automation.”
Many enterprises have been doing the automated data collection he labels stage one for some years, but it’s usually siloed, with separate vendor tools for collecting data about storage and virtualization. Because of costs, typically only 10% of an enterprise environment will use agents to collect application performance management data and that’s usually done by a DevOps team tracking customer experience with the application. “There’s not a great deal of true analytics [at stage one]; just a lot of data coming in and a lot of eventing against a hardline policy, and certainly no remediation,” Piraino explained.
The vast majority of enterprises are at stage two, where they start to consolidate data from those different silos. This usually includes supervised analytics, machine learning using a single metric, alert enrichment by collecting more data about the relevant device or component, and runbook automation. Piraino characterizes this as “I’ve got a lot more data at my fingertips within which to make a decision, but lots of times the actual decision or configuration change orchestration is still manual in nature.”
More progressive enterprises are starting to take the leap to stage three by contextualizing data and integrating an ecosystem of tools. “This is where enterprises say okay, we have the data swamp, let’s turn it into a more structured data lake so we have context. So we understand the dependency mapping, we’ve classified business services, we have classified application services, IT services, we’ve put all the various dependencies in. We understand what external SLAs are, we’ve understood what we’re aiming at for internal KPIs. We can start to automate a lot more of the modeling and there’s a lot more unsupervised analytics taking place.”
That may not need powerful algorithms like deep learning, he notes. “A lot of the time people just want real time immediate [information]; tell me what’s wrong right now, tell me the root causes and give me a rich set of data around the problem so we can start to take an automated action, tying in to incident management tools and third party ITSM tools like change management systems.”
Only a few enterprises have already reached stage four, where AIOps becomes more of an advisor. “How do you actually get the system to make recommendations for additional data sources [to consider], for future potential health risks are coming down the pipe, and what recommended actions can it take.” That’s based on multivariate analysis, multidimensional analytics, and conditional automation, Piraino said. The system “can start to say, ‘if this then this,’ ‘if that then this,’ and what else can I learn before I got to the next stage?”“Stage five is the fully autonomous cognitive behavioral network that does complete corrective data optimization, all algorithms are self-tuning based on performance and self-learning. There’s a complete closed loop automation of self-healing and nobody ever has to touch anything.” That’s what everyone thinks AIOps will bring them, but no one has that implemented, he maintains; “if somebody says they have this, it’s probably for a single application in a very closed environment.” Getting there, or even to stage four, requires more than technology changes.
Prepare for Process Changes
Successful AIOps requires an even bigger cultural change than DevOps, Singh said, because it will need to move beyond IT operations to cover all customer-facing systems. “Your CRM application has to come into play and at some point maybe even the social interactions that your organization is going through with your customers and with your potential audience to really understand your customers’ sentiment and experience.”
That means AIOps will cross even more process and organizational boundaries, far beyond the obvious lines like networking, storage, security and application teams who might all be involved in understanding a problem and pushing out the necessary changes. “Sometimes a lot of time is lost to back and forth and interacting between teams. because of the context that we are not getting,” Piraino said. The obvious benefits are reducing the time to detect and repair problems, which improves availability and reduces SLA penalties. Longer term improvements are about reducing incident rates, increasing mean time between failure, making engineers more productive and shifting them from mundane, low-value tasks that can be partly or fully automated to more engaging and satisfying work — and that’s where more fundamental change may be involved.
“AIOps will drive a cultural shift in how people operate and how organizations understand the flow of data and the impact of the data. There will be less debate and maybe more fluid flow of information for people to take action.”
In the long run, AIOps may change company processes rather than improving automation of the way things are already done. Link sees that happen when organizations walk through how they handle incidents as part of adopting AIOps. “They look at it from an application perspective, or from the perspective of the storage team, or the architecture team or the cloud team and they immediately recognize that they are out of sync with each other and they’re not necessarily feeding each other with the right details and information,” Piraino said.
That’s even more important for modern applications built using microservices, Piraino said. “You have to really rethink how you’re going to look at the full stack. Developers are starting to realize the application has to understand what the underlying infrastructure is doing at the app layer, but the applications don’t necessarily have that data set. It requires retooling the capabilities for operations to provide those insights to the application developers so that the applications can be infrastructure aware, and that’s what’s driving AIOps.”
New Relic is a sponsor of The New Stack.