Monitoring

Cribl Tackles Observability Costs, ‘Agent Fatigue’

30 Mar 2020 2:05pm, by

As IT organizations seek observability into their operations, they deploy more systems for logging and monitoring, creating a firehose of data that must be managed. Along with that, costs spiral out of control.

That’s the problem San Francisco-based Cribl is tackling. Its LogStream platform creates what it calls an observability pipeline that can ingest machine data from any source; parse, restructure, and enrich data in flight; then route that data to an appropriate destination.

That might include culling duplicative or unimportant data and routing other data to cheap storage.

“Customers have a lot of different tools. They’ll have a time-series database, they’ll have a log store, often multiple log stores, but there’s no connective tissue between all those things. They are, you know, everything that they buy is largely built to only work with the thing that they bought. And so they have a problem that I call agent fatigue, which is, you know, every time they want to try something new, a vendor comes along and says, ‘Hey, just throw my agent out there,’” explained Clint Sharp, Cribl co-founder and CEO.

That leaves large organizations trying to manage a multitude of agents across multiple systems, leading to what Cribl calls “agent fatigue.”

“These are companies that are already onboarding terabytes of data a day into big data storage systems for analysis for operational and security purposes. They’re already struggling under the cost of that,” he explained.

In a blog post, he describes an observability pipeline this way:

“A streams processing engine which can unify data processing across all types of observability (metrics, logs and traces), collect all the data required, enrich it, eliminate noise and waste, and deliver that data to any tool in the organization designed to work with observability data.”

So he and his co-founders Dritan Bitincka and Ledion Bitincka, all three former Splunk engineers, built a stream-processing engine for logs and metrics that sits inside customers’ existing ingestion pipeline. It enables users to parse and shape events in the stream, no matter the original format, add context, aggregate and more, before sending data to a desired location.

It allows them to reuse their existing investments and deployed footprint of agents to send to multiple destinations. Users can keep a single agent, such as from Splunk or Elastic, and send data, not only to Splunk or Elastic, but to Snowflake or Databricks or just an S3 bucket where they may be keeping the data for compliance.

“We allow them to control cost by working with the data before it ends up in, in whatever destination system they may have,” he said.

“One of the dirty secrets of log data is that, you know, a lot of this data is just noise. It’s not particularly valuable, but it gets sucked up as part of the broader data-collection strategy. And so we give people tooling that allows them to get at that data in the stream and make decisions about what is the best use of that data. Should it just be stored somewhere cheap, like an S3? Should it be reshaped or transformed in order to remove uninteresting information and then sent on to wherever it should end up?”

That can reduce the cost of ingestion and storage in multiple systems.

An Engine with No Schema

LogStream ingests logs and other data from Splunk Forwarder, FluentD, Elastic Beats, Syslog StatsD, and other systems. It sits in the middle of your log ingestion pipeline. If you’re using Splunk, it uses a heavyweight forwarder or indexer. In other systems, it may deploy independently or run serverlessly in AWS Lambda.

LogStream is a schema-less engine and it can normalize logs with different schemas.

From a dashboard, administrators set up routes, set up actions to be performed on specific data sets and choose destinations for the data. There are more than a dozen out-of-the-box functions including redacting personal information, encrypting, sampling, removing fields, and enriching the data with information like GeoIP. After processing, it sends the data to a chosen destination or multiple destinations such as Splunk, Kinesis or S3.

Customers pay based on the volume of data being ingested into LogStream as well as the volume of data being sent to a destination.

“We think LogStream provides a very flexible approach to managing logs such that large organizations can gain much better control over their logging operations in a way that could deliver several important outcomes, including increased security, lower costs and the ability to collect the volume of data that internal users demand,” Nancy Gohring, senior analyst at 451 Research wrote in a recent report. (subscription required)

She added: “Like any vendor that’s part of a developing segment, Cribl faces challenges around educating the market and quickly refining its product.”

The company differentiates based on performance — in a blog post it touted being 7x more efficient than LogStash and Fluentd and 55x faster than Apache NiFi — but also on ease of use, according to Sharp. Cribl also recently noted that it had tested at 20 petabytes per day.

“In an open source world, you’re left to building all of this yourself. How’s it going to get deployed? How’s it going to get managed? How do you test these pipelines before they go to production? How do you actually do like a rolling blue/green deployment, for example? You’re left in building all of that kind of on your own,” he said.

Existing open source projects are not yet fully solving the problem without a lot of integration and work, he maintains.

“We have customers who are processing in the hundreds of terabytes a day in terms of data volume. … they’re spending tens of millions a year just on the infrastructure to run these systems. For them, turning off an extra 20% of data may save them three to five million a year on a $20 million dollar infrastructure spend.”

Meanwhile, Splunk, which offers a competing product, added to its microservices and cloud infrastructure capability with the acquisition of SignalFX. Other commercial tools, including Datadog, Wavefront, and AppOptics make it easier to combine monitoring, logging, and request tracing for unified insights into the performance of distributed systems, as OpsRamp Director of Marketing Deepak Jannu pointed out in a previous TNS post.

It’s becoming an increasingly crowded field. In its Market Monitor report, 451 Research forecast that monitoring will become the biggest segment in the container market by 2022.

Feature image: “Magnified” by Paul Haahr. Licensed under CC BY-SA 2.0.

A newsletter digest of the week’s most important stories & analyses.