Loom Systems Adds a Human Touch to AI for Root Cause Analysis
Mobile advertising platform Taptica used multiple monitoring solutions for proprietary data-driven mobile ad targeting system. Using predefined thresholds, its operations staff constantly defined new alerts and modified its dashboards to stay ahead of production issues.
Yet as the company grew, it needed to reduce expenses and resolve issues faster. Enter Loom Systems, a company designed to automate root cause analysis using artificial intelligence on log data and provide recommendations for resolution.
Loom Systems, founded in 2015, is another company whose founders previously worked with Israeli Defense Forces. It has its headquarters in San Francisco and engineering in Israel.
It’s another approach to applying AI to log analysis, but augmenting that with a knowledge base of crowdsourced data to respond in a way that humans do. It offers Big Data analysis but does not require its users to be data scientists, according to CEO Gabby Menachem.
Of course, the problem is sifting through all the data generated by log monitoring systems. Midsize- to enterprise-level businesses don’t want to hire math majors or data scientists to sift through data on operations, they just want their current staff to be better at using advanced analytics tools, Menachem said.
“With no math knowledge or AI knowledge or tweaking data models, you can enjoy advanced analytics and come up with issues that might affect your business or your users and solve it in minutes,” he said.
The solution does not require any data preparation for ingestion or manual configuration of thresholds or manual recalibration. All data is dynamically aggregated and correlated in real-time to detect of hidden and emerging issues between applications and services.
You can connect your log monitoring tools to it, then “go off and eat ice cream,” according to product vice president Dror Mann in a 2016 interview.
It can be implemented as SaaS or on-premise and agent-based or agentless. You can put an agent on your servers and connect directly to Loom or you can connect to the systems already collecting your logs, such as Splunk or Elastic. If you already have an agent, there’s a wizard in Loom to connect to that.
“With SaaS, we have a server in AWS or Azure with your name on it, then you just have to pinpoint the logs in our direction. You can use our Syslog or our agent if it’s a Microsoft environment. We learn the structure, we learn your baselines and highlight items of interest after 90 minutes of work,” he said.
The company says it provides a 45 percent reduction in mean time to resolution.
In effect, it’s an automated parser, Menachem said, with intelligence built on top. The underlying database is the Druid column-oriented distributed data store. It uses Elasticsearch for a search function and it saves graphs in Graphite.
Beyond statistical models associated with anomalies, it builds on Loom Chief Technology Officer Ronny Lehmann’s previous work at Biocatch, a company focused on analyzing keystrokes and human behavior to prevent fraud.
Its results are based on what humans would find interesting in the logs, Menachem said.
“IT operations data is very noisy, even regular behavior… can look really anomalous. You see a lot of spikes and stuff like that. As a person working in IT, when you look at metrics, you see them in different time spans, you look at them from a periodic view so you know that that signal, though anomalous, would be something a human would find interesting. That’s a statistical way to do that.
“I’d say we can suppress about 95 percent of the noise… the noise suppression we do comes from these kinds of algorithms. The rest would be from heuristics that we’ve put into the system — and we have over 70 of those — developed through our work with DevOps people indicating what they do when they see these kinds of alerts. [We learned] how companies interact with logs when they have problems. How do they investigate and triage? We built this method into the program …then we built two feedback loops that pair the number of alerts per day with the number of operators you have,” he said.
It makes it really easy for users to provide feedback to help improve the system.
“You can save your tribal knowledge in the system and thus save precious time and increase the efficiency of your Tier 1 support engineers,” Mann said.
Competing products tend to alert to problems already happening, Menachem said, while he differentiates in Loom’s ability to predict and find blind spots.
The cloud logging market has become crowded, with machine learning a key piece. Entrants include Sumo Logic, Rapid7, which bought out Logentries; Graylog, Loggly, Papertrail, which was acquired by SolarWinds; LogDNA, Anodot, MoogSoft and Sematext.
Logz.io also applies machine learning to logs, then will direct you to discussion threads online about that issue and by monitoring your reaction to its suggestions, adjust its algorithm to learn to provide better insights in the future.
Vendors such as BMC, CA, Cisco/AppDynamics, Datadog and others who are building back end big data systems and employing machine learning are shaping the conversation around performance monitoring, according to 451 Research analyst Nancy Gohring.
In an evaluation of Loom Systems, she notes that it ingests both structured and unstructured data, most commonly from a log management system such as Splunk or application performance management system. It uses machine learning throughout every part of its system and makes it easy for users to set up and manage it. However, being an add-on to other log-management tools presents a downside for many companies that already have too many tools to manage. She predicts it will be an acquisition target for other vendors seeking to shore up their machine learning capabilities for IT operations.