New Relic AI: An Antidote to Alert Fatigue
New Relic sponsored this post, which was authored independently by The New Stack.
At its FutureStack New York conference last month, New Relic launched New Relic AI, the result in part of the SignifAI acquisition New Relic made earlier this year. SignifAI offered a superset of technologies — it ran on Prometheus, OpenShift and about 60 monitoring tools. The New Relic service encompasses SignifAI technology. It is now in beta and will be ready for general availability in early 2020. New Relic is currently in the process of integrating the UI capabilities into the existing backend that the SignifAI team developed.
The New Relic AI service is meant for site reliability engineers (SREs), DevOps and on-call teams, said Guy Fighel, chief technology officer of SignifAI who is now leading the New Relic AI effort. New Relic AI has tailored the different SignifAI technologies. The service is a suite of solutions — mostly re-architected to provide deeper anomaly detection and predictive analytics that are on top of the time-series data that New Relic already has, such as application performance monitoring (APM) data, Fighel said.
From the beginning, Fighel and his team recognized an acute problem facing people responsible for monitoring IT environments: alert fatigue.
By the time New Relic acquired SignifAI earlier this year, the service offered:
- Chewie, an API to manage monitoring platforms with the stated intent of reducing noise in a team’s incident management report.
- Integration with the Cloud Native Computing Foundation’s Prometheus and Red Hat’s OpenShift for monitoring visibility and correlations of alerts and metrics to relevant logs and events.
- SignifAI Decisions — a correlation engine for SRE and DevOps teams that uses correlations to provide insights into production systems.
Using SignifAI, the New Relic AI service plugs into incident management platforms, Fighel said. The information is combined from inside New Relic One and other third-party services that SignifAI consumes with the technologies previously developed. With New Relic AI, the user may consume the information that’s inside their existing management platform as well as New Relic One.
It’s important to note that New Relic One represents how the company views its future as what it calls an observability platform. The vision is to deliver context through its user interface and the ability to manage logs in one place and offer programmability capabilities on its platform. At FutureStack, New Relic unveiled its overall view of observability, seeing it as a way to deepen its reach into the enterprise and offering advanced tooling such as New Relic AI.
An SRE, for example, tells the AI service what sources to start analyzing. If it’s New Relic alerts, it’s a matter of checking the policies. APIs from incident management platforms or pre-authenticated APIs from other services can be integrated to automatically sync with the service and start sending into the New Relic API. Users are provided more context, the pages they receive get enriched and streamlined.
In terms of signals, every single type of data that is collected gets classified in New Relic AI. The platform looks at the historic SRE golden signals such as latency, saturation, error rates and availability. The subcomponents are also classified to help determine what is potentially causing the incident storm.
All of this data is then correlated to reduce the amount of noise that is then enriched with the probable root cause. For example, it looks at what components are affected. Other information is then added that shows an anomaly such as a spike in a specific metric. The user is sent one specific incident with all of the information or insight to provide context.
GE Digital is using New Relic AI to find ways to reduce the noise from too many alerts, said Boris Grinberg, a monitoring product leader and DevOps manager of global monitoring at GE Digital, in an interview at FutureStack. Grinberg is satisfied with the results so far with New Relic AI. He’d like to see the ability for the New Relic system to make recommendations, especially for New Relic users who have less experience. It’s more effective to scale the technology than the people. It would be especially helpful in continuing to reduce alerts that people now receive that can take a lot of time to resolve. They may not even be issues at all after analysis by more qualified engineers.
“Noise pollution is a killer for productivity,” Grinberg said. “This is the number one challenge enterprises are going through.”
Machine Learning Means Context
Context in New Relic AI is based upon machine learning, for example through dynamic titles. Incidents have historically received static titles for incidents. Each incident would receive a different name. A CPU may be over 80 percent or there may be a failure in connecting a server or a container, each incident receiving a different title. Now that problems are aggregated, the system crafts the title dynamically.
In New Relic AI, the incidents are dynamically classified to include the context. The classification is part of the meta capability of the system. New Relic AI uses an expert system to decide titles and evaluate different signals. Inside the title, they will auto-populate the most relevant entities that are involved in a particular incident.
“How do you correlate?” Fighel asked. “Do you correlate based on time? Do you correlate based on similarity? Do you base it on different classifiers? So we are applying multiple different techniques. We have an algorithm to choose what is the best technique to automatically correlate. Later, before we go GA, we will expose all of those capabilities to the users. So as a user, you can actually craft your own logic.”
Crafting logic will mean the user may actually add to the New Relic AI correlation engine. It will adapt according to the inputs of the user. Through its dropdown UI, the user may define criteria such as CPU, network connectivity and frequency or period of time. The engine will then take the results and update the model based upon the inputs.
New Relic is opening up its AI platform to allow users to add their inputs. At FutureStack, the company cited how customers are seeing an 80 percent drop in noise from alerts.
The New Relic AI story relates to trends in AI such as research into natural language processing (NLP). AIOps is about the data and how it can be used to better optimize operations, and that requires NLP.
New Relic’s peers in the market include such companies as Big Panda, Splunk and Sumo Logic. There are several others in the space, each following the core tenets that call for analyzing lots of data in a manner that provides operations with less noise and more efficiency. New Relic looks at AIOps through the lens of the user, in particular, the roles that SREs play and the machine data that is coming in such torrents.
The Cloud Native Computing Foundation and Red Hat are sponsors of The New Stack.
Feature image: Guy Fighel.