Why Did Grafana Labs Need to Add Adaptive Metrics?
It is hard not to hear high cloud costs as a pain point when talking about the challenges of cloud native architectures and Kubernetes. A major concern that organizations face, even after successfully transitioning to cloud native, is the unexpected rise in operations costs. Ironically, one of the ways to mitigate these costs is through observability, which can also be expensive when relied on to improve application productivity and operations efficiency and security.
In the observability space, the surge in metric data to monitor represents a major culprit when it comes to cloud native costs. This is because a surge in metrics that are redundant — that often come in spikes following an incident or misconfiguration — represents wasted storage, computing power, memory consumption, analytics and other expensive resources on the cloud. The issue is described as high levels of cardinality, as “cardinality” in the general sense is defined as the number of elements in a given set according to Merriam-Webster. In the context of observability, cardinality refers to the count of values associated with a specific label.
As a popular open source monitoring tool for cloud native environments, Prometheus metrics data are often scrutinized as a way to better manage cardinality due to the abundance of metrics that are crucial for observability. This pain point was felt at Grafana Labs, which is almost universally known for its famous Grafana panels. In response, Grafana recently introduced adapted metrics that aim to reduce cardinality and, consequently, cloud costs and made it available to Grafana Cloud users.
This reduction in cardinality is achieved through automated processes, intended to decrease the number of metric series. It does this by automating the process of identifying and eliminating unused time series data through aggregation. By reducing the number of series or cardinality, adaptive metrics is thus designed to help organizations optimize cloud expenses. Additionally, these metrics assist in the interpretation and extraction of actionable insights from the collected data through automation, for meaningful observations and decision-making that lead to actionable insights.
Reducing cardinality is a standard problem for data scientists to solve, involving the evaluation of the contribution of individual values to the prediction accuracy for the target variable, Torsten Volk, an analyst at Enterprise Management Associates (EMA), told The New Stack. For example, in observability, the target variables often are app performance, user experience, cost and resiliency. To reduce cardinality, the software can simply apply standard techniques such as principal component analysis, target mean encoding and binning. These calculations combine or eliminate values based on their contribution toward accurately predicting the target variables, Volk said.
“For example, instead of tracking exact numbers in milliseconds for response time, you may not lose any prediction accuracy by translating these numbers into percentiles. Or instead of tracking each individual value of a data stream, e.g. for memory usage, the algorithm might look at historical data and determine that you will get the same predictive accuracy by analyzing averages at the minute or even 10-minute level,” Volk said. “This is not a trivial challenge, as in certain cases prediction accuracy may significantly benefit from sub-second level measurement values, while for other cases aggregating these same measurements over 60 minutes may give you the same level of accuracy.”
As mentioned above, Grafana first developed adaptive metrics to address its own cardinality challenges. “Prometheus has become hugely popular for good reason, but when there’s rapid adoption within an organization, unpredictable growth and cardinality can be a real challenge. We’ve felt this pain ourselves at Grafana Labs. We were spending quite a lot of money running our own Prometheus monitoring for Grafana Cloud, as one of our clusters had grown to over 100 million active series,” Tom Wilkie, CTO for Grafana Labs, told The New Stack. “Adaptive Metrics was the solution we built for this problem. And we knew that in this current macroeconomic climate when budgets are tightening and people are gasping at $65 million observability bills, a feature that helps you cut some unnecessary costs in a flexible, intelligent way would be incredibly valuable to our users, just as it has been for us.”
As Wilkie explained, open source changes “the relationship between vendor and customer, because they can always go run it themselves.
We look at our relationships with our customers as long-term partnerships, so we want to do what’s right by them (proactively lowering their bills) even if this means less growth for us in the short term,” Wilkie said. “With features like Adaptive Metrics, we are making the case that it’s always more cost-effective to use Grafana Cloud, even compared to running the OSS yourself.”
In a blog post co-authored by Grafana Labs’ Archana Kesavan, director, product marketing, and Jen Villa, senior group product manager, Databases, described how Grafana’s Adaptive Metrics capability analyzes “every metric coming into Grafana Cloud” and compares it to how users access and interact with the metric. In particular, they wrote that it looks at whether each metric is:
- used in an alerting or a recording rule.
- used to power a dashboard.
- queried ad hoc via Grafana Explore or Grafana’s API.
To answer the first two questions, it analyzes the alerting rules, recording rules and dashboards in a user’s hosted Grafana. To answer the third, it looks at the last 30 days of a user’s query logs. With these three signals, Adaptive Metrics determine if a metric is unused, partially used, or an integral part of your observability ecosystem:
- Unused metrics. There has been no reference made to the metric based on any of those three signals.
- Partially used metrics. The metric is being accessed, but it has been segmented with labels to create many time series, and people are only using a small subset of them.
- Used metrics. All the labels on that metric are being used to slice and dice the data.
“Our initial tests in more than 150 customer environments show that on average, Adaptive Metrics users can reduce time series volume by 20%-50% by aggregating unused and partially used metrics into lower cardinality versions of themselves,” Kesavan and Villa wrote.