New Relic Takes on Configuration Errors with Infrastructure Monitoring
A misconfigured software release was blamed for a four-hour outage at the New York Stock Exchange last July, just one of a troubling rise in configuration-related incidents, according to Jim Stoneham, New Relic vice president of product.
The company just unveiled New Relic Infrastructure to provide real-time visibility into the IT stack and its operations. It’s the product of the company’s acquisition last November of Opsmatic, a San Francisco-based startup focused on monitoring critical configuration changes.
“If you look at traditional server monitoring tools, they’re capturing metrics, CPUs, the disk, the network — you can see the health of what’s going on, but you’re not really seeing below the surface. You’re not seeing what’s changing, what’s a time bomb waiting to go off,” said Stoneham.
Customers have tried to use configuration management tools for this, but that requires the describing of everything on the host to make sure you’ve got the whole surface area covered, and people often neglect to include OS versions, kernel settings and other details, he said.
“People have also tried log analytics tools, but there, you have to tell the correct logs, and when you do a query it can take minutes to get results back. So when you’re building a picture of what’s going on, it generally takes more time than you have when you’re suffering from an outage. With NYSE, you’re dealing with hundreds of millions of dollars in transactions,” he said.
New Relic Infrastructure places a lightweight software agent in every host and immediately begins gathering data — every software package, user sessions, processes, kernel settings — that can affect that instance in the cloud or data center. It detects changes while still monitoring metrics such as CPU, disk and network to watch how that host performs and how it changes. It fits into an end-to-end suite of services including app performance monitoring and server monitoring. All the data goes into New Relic Insights, its cloud-based data platform and analytics engine.
SignalFX software engineer Rajesh Raman recently spoke with The New Stack’s Alex Williams about the importance of analytics in monitoring. “People who are monitoring and running infrastructures need to have insight into that infrastructure, and how their applications are running,” Raman said.
A recent survey by monitoring vendor SevOne found that traditional monitoring tools aren’t cutting it for most IT organizations — only 11 percent of the 322 IT execs polled were satisfied with their existing tools. They said they don’t completely trust the data (84 percent), aren’t satisfied with scalability (86 percent) and feel their tools don’t support their strategic initiatives, such as hybrid cloud, IoT, and software-defined networking.
The use cases for New Relic Infrastructure include:
- Triage for any combination of cloud, containers, or traditional servers.
- Avoiding downtime and incidents, providing faster mean time to detection (MTTD) and resolution (MTTR) with dashboards and alerting that are driven by tags and metadata from cloud, automation tools or custom attributes. It provides native monitoring for AWS EC2 and Docker tags.
- Zero-day: Its search tools help find bugs quickly, rather than hours or days. Users can search events by location or type across their infrastructure.
- Resource optimization: It provides insight into whether you should add or remove nodes in a cluster to save money.
- Visibility into team actions: By identifying every change to a system and who made it, the system “shines a very bright light” on every action taken on that environment.
“We have dashboards for each cluster or host, but you can also pull out a quick at-a-glance view of every cluster. So if you’ve got tens of thousands of hosts across hundreds of clusters, you can see what’s working and what’s not,” he said.
“I think it’s important to allow people to see their infrastructure in the way it’s actually laid out in their data centers,” said Stoneham. “I’ve got a web tier, I’ve got an east and a west coast data center, I’ve got a database I’m running in three clusters and the product is designed to very quickly define those clusters using tags. And we take in tags from systems like AWS or automated systems like Puppet and Chef and allow them to very quickly build a view of their world. This is usually seen on a big monitor in an ops bullpen because it provides an at-a-glance view of the health of where the system is.”
Customers have said that in the past when they had a zero-day vulnerability, they had to stop what they were doing and spend days checking versions running across all their hosts across their hybrid infrastructure.
“It should be as simple as doing a quick search. When you have this live view of what’s running across all your hosts, it’s very easy to look at the data and answer questions like this,” he said.
If there’s a zero-day vulnerability, for instance, an OpenSSL problem, it might show 12 hosts running one version of open SSL and eight running another version.
“From here, you can grab the host list and go patch to the new version,” he said.
This search capability also can be used to find errant services or things that have been deployed that you’re worried about. It’s very much like a search engine for your infrastructure.
As teams become more collaborative, the insight into team actions can be valuable. If there’s a problem and someone’s made various changes recently, you know who to talk to about it.
“It’s not meant to be punitive, but just to make sure everybody understands that it’s visible and allows everyone to be on the same page about what’s happened recently,” he said. “I find it kind of improves the culture and behavior,” Stoneham said.
New Relic is a sponsor of The New Stack.