Cloud Native / Monitoring / Tools

Lightstep’s ‘Change Intelligence’ Promises Faster, Smarter Distributed Tracing

4 Feb 2021 1:25pm, by

Observability provider Lightstep has upgraded its platform engine to offer application-performance analysis that addresses what the company feels are rampant shortcomings in observability and current monitoring tools and processes.

Faced with an explosion of logs and data, administrators will find the Change Intelligence feature was designed to gather meaningful insights and actionable results by taking into account rapidly changing variables in today’s IT environments. In this way, Change Intelligence, which is now what the company describes as the “engine” of Lightstep’s observability platform, was designed to provide straightforward analysis while running under the hood and performing enormously computationally difficult tasks. These tasks largely stem from time-series database capabilities (TSDB) designed by Lightstep engineers, many of whom designed Monarch for Google.

“Change Intelligence specifically addresses this, the single most important question in all of monitoring and observability: What caused that change?,” Lightstep’s Ben Sigelman, CEO, co-founder, co-creator of the OpenTelemetry project and a former Googler, told The New Stack. “It takes the insights buried in the firehose of telemetry data — both metrics and traces — and makes them accessible wherever changes occur.” These changes could occur either in core monitoring dashboards and alerts, in CI/CD, or via programmatic APIs and integrations with other DevOps tools, Sigelman said.

The main idea is to make core monitoring actionable, which Sigelman says is “incredibly important.”

“Have you ever looked at an anomaly and had to breeze past it because diagnosis would be too hard? This is especially true for classic infrastructure metrics like CPU or heap usage spikes — now Change Intelligence allows you to quite literally just click on any anomaly and be brought to a guided analysis of possible changes throughout your distributed application that could explain that change,” Sigelman said. “This supercharges and really redefines what monitoring can and should be. And at the same time, it makes observability more accessible and more contextualized for SREs and DevOps engineers who don’t have the time to become ‘observability experts.’”

As mentioned above, the motivation behind Change Intelligence’s development was to address issues DevOps teams face with observability and monitoring. The two main reasons DevOps teams “have been struggling” is because conventional monitoring tools, such as dashboarding and alerting, are separated from special-purpose observability tools. As a result, “it is too difficult to apply observability insights to the everyday problems uncovered by said conventional monitoring,” Sigelman said.

“Since DevOps engineers are stuck in their monitoring tooling anyway, they inevitably try to solve ‘observability problems’ with their conventional monitoring tools, and that doesn’t work, because monitoring tools are okay at revealing when critical symptoms change, but they’re terrible at explaining why those changes occur,” Sigelman said. “Teams end up with literally thousands of dashboards and catastrophically large monitoring bills, and yet they still can’t reliably explain why charts show sudden deviations or why alerts are firing.”

A typical use-case scenario would be when an engineer is paged following an increase in user-facing errors. Without Change Intelligence, Sigelman described how the engineer might start manually scrolling through dashboards looking for other suspicious behavior and searching through logs trying to find patterns.

“Change Intelligence automates this process by providing a full-stack analysis of both the requests with errors and those without and uses that analysis to pinpoint probable causes,” Sigelman said. “The engineer can then consider these causes (with associated evidence) and rollback a deployment, scale-out a service, or find an expert that can help — whatever they need to do to mitigate the problem.”

The key point is “there’s almost always a change that leads to these sorts of problems — again, either a change in the software, the infrastructure or the workload,” Sigelman said. “But there are also hundreds or thousands of other changes happening in production systems every day, so understanding which changes matter is not something that’s easy for humans to do,” Sigelman said. “There are often a few experts in each org that have a strong enough intuition that they can do it: our motivation behind offering Change Intelligence is really about bringing the benefits of that intuition to every developer that’s feeling this pain.”

A newsletter digest of the week’s most important stories & analyses.