Picture this: You’re sitting in front of a bank of monitors, watching dashboards from eight different systems. You allow yourself a smile because everything is green, across all systems. And Bam! The production database goes down.
The familiar crisis routine gets launched and after several hours of trouble shooting and angry clients/managers/CEOs, you discover that a recent update to PagerDuty is not completely compatible with a slightly old version of NewRelic which is scheduled to be upgraded next week. You want to write up a postmortem with the fix you so painstakingly discovered but a new crisis takes your attention.
There has to be a better way, thought Guy Fighel, chief technology officer and co-founder of SignifAI. A year ago, he gathered together a team of TechOps professionals who got tired of their jobs managing multiple systems across multiple monitors, each system with it’s own alerts, tracking different things, and spending most of their time putting out fires.
There’s so much they wanted to do to improve the system, wrote Fighel in a blog post, but they could never get there because they were constantly in crisis mode.
As Capital Picard says, if you’re on red alert every day, then red alert means nothing.
Fighel looked at the rise in the cloud and machine learning and thought, ‘What if we create a monitoring system on top of all these existing monitoring systems? What if this system ties all of them together into one huge data set so we can track downtime across systems and collate the event data, the log data and the metric data to give us ways to predict outages or possible outages. And then use machine learning and machine intelligence to capture and store the post-mortem information?’
Fighel and SignifAI CEO JP Marcos cherry-picked a team of TechOps engineers, all of whom had experience with monitoring in complex stack environments, and began to build the system they wanted. The goal was to free their time from the drudgery of monitoring systems and the nightmare of crisis management, on-call nights and weekends. They wanted to be able to do fun stuff like fine-tune the environments or make them run like a super-tuned sports car. And maybe keep up with all the system updates on a sane schedule.
“We essentially built a tool for ourselves,” said Marcos. “How can I cut through the data to what’s important? How can I reduce MTTR [mean time to repair], how can I understand what happened right away in a language I can understand, and once I figure out what it is and fix it, how can I capture that knowledge in a streamlined way in a direct language that I will actually do.”
The first rule, said Marcos, was to make it easy. Abstract away as much of the complexity of running multiple systems as possible. The second rule was to make it easy. Capture post-mortems in four screens. The last rule was to make it easy. Run all of the systems off one screen, showing only the most high-level data, with drill-downs into individual monitoring systems. And so it is.
Setup in 20 Minutes
SignifAI pulls data through APIs to the IT assets. The initial set-up is wizard-driven, a simple set of steps where the customer selects which systems they want to add. The entire process, including the data download, takes about 20 minutes, the company claims.
The software currently support over 60 integrations including Slack, NewRelic, PagerDuty, GitHub, AppDynamics, Amazon Web Services, and Datadog.
There are two ways the system connects to the data, Marcos explained. One function, the Active Inspector, looks inside the monitoring data from all the systems; Another one, called Web Collector, listens to alerts.
The control center aggregates alerts from all the connected systems into ‘issue cards,’ which are sorted by criticality, source or type of alert, depending on how the system is set up. There are two types of alerts: blue, which is an issue happening right now, and yellow, which is an ‘insight’ (e.g., you will run out of memory in x container in five days).
The control center shows only very high-level data, which cuts out the noise generated in other systems by too much information. If you are interested in specifics, drilling down is a click away. This leaves the control center screen clean and easy to follow.
Meet SAM, Your New Team Member
One of the advantages of the product, said Marcos, is SAM, which stands for SignifAI Augmented Member. Marcos sees the automated artificial intelligence as a team member for monitoring groups.
“Machines have perfect memory,” said Marcos. “A year from now your engineers might not remember exactly what happened, but the computer does.”
SignifAI is constantly correlating data behind the scenes to look for problems that potentially threaten uptime. SAM correlates and applies predictive algorithms on huge volumes of log, events and metrics data across every component across all systems in real-time.
Machine learning enables fast root cause analysis and provides answers and insights that would take a team of human engineers much, much longer to accomplish, Marcos said. Like a few seconds instead of days.
How SignifAI Works
By breaking the data silos across your company, SAM intelligently outputs across and down into the stack by transforming time series, events and log data into a uniform data set. It then applies a series of analytics following a logical framework that is informed and closely resembles that of a human expert, said Marcos.
When companies set up alerts, they are looking for known information, he explained. For example, when you track server memory usage, that’s considered “known data” in SignifAI because you know what you are looking for.
What SignifAI does is look for the unknown information. The information you can’t know because you don’t know it’s there. By looking at all the data available and capturing data as you go, the computer can look across and into your full stand and mine correlations you didn’t know existed. For example, the correlation between an upgrade in one system and breakage in another system.
Applying machine learning to metrics, users’ behaviors, feedback and human experts allows SAM to continuously adapt and improve its results. Throughout this process, said Marcos, it also captures the team’s knowledge and makes it automatically available when the content is needed. “This allows a TechOps team to get fast access to accurate answers, predictive insights and leverage a growing knowledge base in order to faster address and prevent issues affecting system uptime,” he said.
Capturing postmortem data has traditionally been hard in a TechOps environment because teams are usually moving from one crisis to the next. Marcos said SignifAI makes this easy as well. When SAM surfaces an issue, capturing resolution is a quick four-step process.
On the “Issue Card,” there is a button called “I fixed it.” This opens a series of four windows. The first “What was the root cause? You click on one out of a fixed list of common root causes. The next window asks what systems were affected. The third window asks if any scripts were used. The last window is free text, to capture in natural language any context or things you will need to know if those conditions happen again.
All of these conditions are combined, and then if those conditions are met in the future, the software surfaces the information automatically, said Marcos. This is where the machine’s perfect knowledge and memory come in handy. SAM will create an issue card will alert the user when similar problems occur, along with suggestions on how to address the issue.
The system is constantly learning. “We learn from your feedback. We learn from all the text that is captured from Slack, from PagerDuty, from all the systems. When a ticket is open, and when a ticket is closed, we learn,” Marcos said.
The learning will be happening across clients as SignifAI expands. If, using the first example, the machine learns that the PagerDuty upgrade has an impact on older versions of NewRelic, SignifAI can look for those conditions across all companies using PagerDuty and NewRelic and send alerts to those companies as well. “That’s how you get the benefits of acceleration,” Marcos said. “We call it Institutionalize Knowledge Transfer.” And that’s essentially another way of learning.
Marcos said since SignifAI is only interested in real-time events, it focuses only on metadata. Raw data is dumped after 30 days. The libraries and
SignifAI runs as a set of microservices in Kubernetes, but Marcos was coy about divulging more information about the closed-stack system because of the proprietary algorithms.
“A big benefit of machine intelligence is that it has perfect memory and can match the cause and resolution of an issue from the past with an issue happening right now,” said Chris Amen-Kroeger, head of ad engineering at Pinterest. As a result, the software helps make “the transition from a team that responds to alarms to one that proactively resolves issues.”