Observability and the Misleading Promise of AIOps
AIOps, a term originally coined by Gartner in 2017, promises to drive better decision making and faster incident resolution by mitigating many of the “people problems” involved in managing complex systems. But any discussion of AI should remember that this technology isn’t magic — AI does similar tasks to what people can, just much faster and more patiently. The hype behind AIOps masks the reality of its readiness to meet its goal. Can AIOps solve the underlying problems that plague Operations? Simply put, no.
The Promise of AIOps
AIOps is a hot new area: Gartner has assigned it to a magic quadrant, and prominent vendors in the space including Splunk and NewRelic advertise their AIOps platforms. AI Ops falls into three categories: observing systems, service management, and automating scripts and runbooks. I’d like to look a little bit at the promises and opportunities around observing systems and service management, and take a look at the problems and the challenges around it.
People adopt AIOps in order to improve operability and reduce human workload. The goal of an AIOps system is to make us better at ensuring that we rapidly deliver high-quality software, and support remediation and repair when systems fail.
A Legacy of Bad Habits
Let’s start with the basics. The DORA metrics are a great way of characterizing how successful a team is at managing software delivery and operations. They measure how quickly software is deployed, how often it succeeds, and how hard it is to recover from any failures. According to DORA’s metrics, high-performing teams have the ability to deploy quickly, to detect whether a deploy has failed, and to repair it rapidly.
To track things like the success of deploys and the quality of releases, we can keep a suite of user-facing signals. For example, we can track the time it takes for pages to load and see whether that is growing or shrinking. Since these numbers can be huge — millions of users, hundreds of pages — we aggregate these into groups, and track only statistics like the 95th percentile of page load time, or the number of error statuses on these pages.
Despite the simplicity of that description, errors can be subtle. The indicator for a bad deploy might be that a particular page loads more slowly than before, or that a certain code path is more likely to generate errors. Before these very user-visible errors occur, computer systems have other ways to indicate distress: a process can run out of memory, or disk, or use all its CPU; various buffers can fill up as processes get slower.
Over time, teams discover that some number of these signals are correlated with other sources of information. Perhaps they’ve noticed that when a process spins out of control, it burns through all its CPU, locking the system up. Or they’ve found that user response times tend to drop when the back-end database gets backlogged. They might notice that a critical process crashes after logs shows that the process can’t write to disk. Reacting to these discoveries, each of these transient signals becomes monitored with an alert or two — a warning that the disk is getting full, the CPU overloaded, and that disaster is likely to strike. The team engineers its processes to spit out logfiles full of signals, in the hopes that referring to these logs can help diagnose what’s gone wrong.
Monitoring and collecting these types of signals might have been a good idea when software ran on monoliths — when we, as the adage goes, kept all our eggs in one basket, and then hired a team to watch that basket.
In a modern era, though, it simply doesn’t make sense. Our systems are too complex, too unpredictable: a machine briefly runs out of CPU because the garbage collector does a transient run; the logfile chatters due to warnings about a system that no one cares about anymore; a temporary cache eats all the memory and then lets it go. This is the new normal.
Alerts are going off all the time; and there’s no easy way to figure out which alert corresponds to which failure. (Was the disk running out of space-related to this user error, or was it caused by something else? Was the fact that the DNS couldn’t be pinged for a little while actually related to the network failures? Does it really matter that it took a half-dozen retries before a network connection could be established?)
The Misleading Promise of AIOps
AIOps, then, seems like a logical next step. Clustering algorithms could bring together spurious alerts, realizing that they have nothing in common, and remove them. Perhaps a sharp-eyed AI system could find correlates across these large and varied datasets: it could notice that user failures happen often when a particular process uses all available memory, and not at other times; or notice that certain user paths are linked to failures.
The promise of AIOps is threefold: an end to alert fatigue, a rapid process for narrowing down the causes of failures, and a way to detect and predict failures before they impact users. The problem with the vision is, very simply, that it is largely intractable for an artificial intelligence system. In fact, AI is precisely the very worst technology for this sort of problem.
That seems like a bold claim; let’s break it down.
AI Needs to Be Trained on Data
The fundamental challenge of any AI system is that it needs to be trained on data. A machine learning system trains to find a boundary between “good” and “bad” examples, and so it does best when it can get a mix from each class. In other words, to train a learning system, we want a population of successes and a population of failures; we can then learn what the boundary between those two groups are.
Unfortunately, that’s not how failing systems work. They tend to largely succeed, until they fail. That means there will be far fewer failure events than success events. Worse, the previous training examples should be useless. While any reasonably good ops team may not be great at anticipating and fixing future unforeseen problems, they can at least keep up with the last one. AIOps relies on known failure modes continuing to occur.
DevOps Lives in a World of Anomalies
One possible alternative to the training problem is “anomaly detection.” The concept behind anomaly detection is that the system can identify groups of data that are “normal,” and then find those that are outside that normal range. We can think of it as drawing a circle around the normal events, and then identifying events that fall outside the circle.
That can be challenging in many cases: it’s very difficult to draw identifying circles around “normal” behavior. We might hope that systems would behave consistently, and so anomalies would be the worrisome unusual cases. Innovation means that systems — and system behavior — are changing all the time. It’s far more likely that we’ll draw too small a circle, and identify a great deal of perfectly normal behavior as “anomalous”; or we’ll draw a circle that’s too large, and misidentify anomalies as normal behavior. (More likely than not, we’ll get both types of wrong).
AI systems look for patterns that they can recognize and repeat. The problem is, once those patterns are found, DevOps teams try to squash them.
Worse, trivial failures happen all the time — there are non-stop trivial problems in any production system, including those that are user-visible. An alerting system looking for anomalies, even if it improbably manages to draw the appropriate circles, will still fire repeatedly and noisily on anomalies that aren’t actually a problem. If an anomaly detector can’t find only the anomalies that matter, then we never really fix the noisy alerts problem.
Deployments Are Anomalies
It gets worse. Remember that the goal of anomaly detection is to find systems that are behaving unusually, but in user-invisible ways. If the failures were user-visible, we’d pick them up with normal alarms. In a world of continuous deployment, it’s hard to picture level-setting on normal — the entire point of a rapid deployment system is to constantly produce new and, therefore, anomalous behavior.
Fixing Broken Builds Is an Anomaly
The fundamental processes of DevOps are precisely the opposite of what makes an AI system work. AI systems look for patterns that they can recognize and repeat. The problem is, once those patterns are found, DevOps teams try to squash them. This is the so-called “known unknowns” problem. The net effect is that DevOps teams end up working against their AI: every time it figures out what sorts of alerts are important enough to set off a pager, the DevOps team has the audacity to go fix the underlying issue. Worse yet, they then get to write a blameless post-mortem about what a great job they did of ruining the pattern! Talk about a perverse set of incentives.
Do You Trust Your AI?
As we try to figure out how to tune our learning model to distinguish between good and bad releases, we run headlong into the open research problem of “explainable AI.” Right now, research is progressing on how to help AI systems explain what decisions they made and why, This work is very hard because people have a lot of trouble making sense of neural net coefficients, or decision forest weighting parameters.
It’s hard to picture an Ops team trusting a system reassuringly saying “everything’s fine, don’t worry” when they think they see things going wrong — or, worse, running around for the fire-drill of an AI system that has found an anomaly that doesn’t actually matter.
Inevitably, to fix that problem, it’s likely they’ll want to backstop the AI system with another layer of alarms. Can the AI truly figure out when user experience is degraded, or does the Ops team need to monitor what it comes up with? Can a vigilant Ops team feel secure that the AI isn’t removing alarms that actually matter and promoting ones that don’t?
Modeling AI with Grey Matter
Again, we need to remember that AI technology isn’t magic. AI can only help if there really are discernible patterns, some form of signal in the data, and if we think that the AI system can be successfully trained to figure out what’s “interesting” and what isn’t. It’s not obviously true that any of this is the case: there are many, many signals coming in, and very few of them will be relevant.
Could a human look at a few thousand noisy alarms and alerts and, with enough patience, weed out the interesting ones? If there really isn’t a signal to be found, then it’s unlikely an AI will divine it. That’s a problem, because humans have been going through these signals. Looking at wobbly, inconsistent signals requires more than a little bit of divination to it. Asking an oracle to interpret your tea leaves is still an oracle, even if it’s OracleBot 9000.
Alternatives to AIOps
The fundamental concept of AIOps is that an algorithm can examine a large number of noisy and irrelevant alerts to divine which few are human-relevant and which are not. But I hope we can start seeing the flaws in that assumption.
What would be a better alternative? In Through the Looking Glass, the White Knight sings of this problem: “… I was thinking of a plan to dye my whiskers green, then always use so large a fan that they could not be seen.”
What we should do, of course, is alert on what matters. When there are too many alerts — that aren’t actionable, that are flappy or noisy, or that aren’t directly aligned with user pain — the signal gets lost in the noise. But AIOps encourages users to abandon these good habits. Instead, send over that flood of noisy irrelevant signals and AIOps will handle the rest. By doing that, AIOps inadvertently causes an explosion of the problems it claims to help with.
A more effective approach is linking your alerts to instances where user experience is degraded. Preferably, build an error budget, and keep track of how bad user experience is getting via SLOs. When user experience begins to degrade, use observability tools that provide traces, rich events, and other helpful bits to find out what’s gone wrong and why.
A sure-fire fix for ending alert-fatigue is to only alert on what matters. Focusing on user experience for alerting, by definition, means every alert is worth investigating because it’s a statement about a problem that a user might see.
But That’s the Data I Have
Perhaps, though, it’s impossible to alert on user performance. For example, if your data system collects only time series of performance counters, then it can be hard to correlate that to user experience. But that’s not a problem for AIOps to solve, that’s a problem that requires using the right tool.
Imagine going to a doctor, worried that you might have broken a bone, and the doctor plugs you into an EEG, an ECG, and EKG before watching you walk around the office. By fusing those sensor readings together, they might recognize that when you step on your left leg, your heart rate spikes; behavior consistent with a broken bone. But wouldn’t it be more accurate and efficient to use an X-ray?
The same is true for AIOps. Using an AI system over the wrong signals might have a chance of finding the problems in your system — but proper observability would make a lot more sense.
Is AI Ready for Ops?
Simply put, no. The broader term AIOps confuses outcome and mechanism. Until AIOps comes packaged in the gold body of a 3PO-series protocol droid, the goal shouldn’t be to get AI — it should be to solve the underlying problems. AIOps promises to ensure that alerts are actionable and based on user needs, to monitor the quality of deployments to ensure they are robust and secure, and to be able to flexibly move between different perspectives on data.
None of these things, however, require “AI.” Accomplishing the intended goals of AIOps can be achieved by build alerts on aspects of your system that matter to your users, tracking how your deployments do against those user-relevant metrics, and by using tooling that allows you to flexibly pivot from these user-visible effects to the rich explanatory data behind it. Find the signal in the noise by both reducing the noise and by using tools that boost your signal.
Full disclosure: The author works for Honeycomb.io, which produces an Observability tool. It doesn’t do AIOps.
Feature image via Unsplash.