Top Ways to Reduce Your Observability Costs: Part 2
This is the second in a two-part series. Read Part 1 here.
Organizations are constantly trying to figure out how to balance data value versus data cost, especially as data is rapidly outgrowing infrastructure and becoming costly to store and manage. To help, we’ve created a series that outlines how data can get expensive and ways to reduce your data bill.
Last time, we included a primer on “what is cardinality” before offering two tips on how to reduce observability costs: using downsampling and lowering retention. Before we jump into the last two tips on cost reduction — limiting dimensionality and using aggregation — we’ll do another quick primer, this time on classifying cardinality.
Classifying Cardinality: A Primer
When it comes to cardinality in metrics, you can classify dimensions into three high-level buckets to consider the balance between value and cardinality.
High value — These are the dimensions you need to measure to understand your systems, and they are always or often preserved when consuming metrics in alerts or dashboards. An example is including service/endpoint as a dimension for a metric tracking request latency. There’s no question that this is essential for visibility to make decisions about your system. But in a microservices environment, even a simple example like this can end up adding quite a lot of cardinality. When you have dozens of services, each with a handful of endpoints, you quickly end up with many thousands of series even before you add other sensible dimensions such as region or status code.
Low value — These dimensions are of more questionable value. They may not even be intentionally included, but rather come because of the way metrics are collected from your systems. An example dimension here is the
instance label in Prometheus — it is automatically added to every metric you collect. Although in some cases you may be interested in per-instance metrics, looking at a metric such as request latency for a stateless service running in Kubernetes, you might not look at per-instance latency at all. Having it as a dimension does not necessarily add much value.
No value (useless or even harmful) — These are essentially anti-patterns to be avoided at all costs. Including them can result in serious consequences to your metric system’s health by exploding the amount of data you collect and causing significant problems when you query metrics.
Now for the good stuff: Our final two tips on how to reduce observability costs.
Keeping Costs Low: Dimensionality and Aggregation
Each team has to continuously make accurate trade-offs between the cost of observing their service or application, and the value of the insights the platform drives. This sweet spot will be different for every service, as some have higher business value than others, so those services can capture more dimensions, with higher cardinality, better resolution and longer retention than others.
This constant balancing of cost and derived value also means there is no easy fix. There are, however, some things you can do to keep costs in check.
1. Limit Dimensionality
The simplest way of managing the explosion of observability data is by reducing which dimensions you collect for metrics. By setting standards on what types of labels are collected as part of a metric, some of the cardinality can be farmed out to a log or a trace, which are much less affected by the high cardinality problem. And the observability team is uniquely positioned to help teams set appropriate defaults for their services.
These standards may include how and which metrics will use which labels, moving higher cardinality dimensions like unique request IDs to the tracing system to unburden the metrics system.
This is a strategy that limits what is ingested, which reduces the amount of data sent to the metrics platform. This can be a good strategy when teams and applications are emitting metrics data that is not relevant, reducing cardinality before it becomes a problem.
2. Use Aggregation
Instead of throwing away intermediate data points, aggregate individual data points into new summarized data points. This reduces the amount of data that needs to be processed and stored, lowering storage cost and improving query performance for larger, older data sets.
Aggregation can be a good strategy because it lets teams continue to emit highly dimensional, high cardinality data from their services, and then adjust it based on the value it provides as it ages.
While tweaking resolution and retention are relatively simple ways to reduce the amount of data stored by deleting data, they don’t do much to reduce the computational load on the observability system. Because teams often don’t need to view metrics across all dimensions, a simplified, aggregate view (for instance, without a per-pod or per-label level) is good enough to understand how your system is performing at a high level. So instead of querying tens of thousands of time series across all pods and labels, we can make do with querying the aggregate view with only a few hundred time series.
Aggregation is a way of rolling data into a more summarized, but less dimensional state, creating a specific view of metrics and dimensions that are important. The underlying raw metrics data can be kept for other use cases, or it can be discarded to save on storage space and reduce the cardinality of data if there is no use for the raw unaggregated data.
There Are Two Schools of Aggregation: Streaming vs. Batch
With stream aggregation, metrics data is streaming continuously, and the aggregation is done in memory on the streaming ingest path before writing results to the time series database. Because data is aggregated in real time, streaming aggregation is typically meant for information that’s needed immediately. This is especially useful for dashboards, which need to query the same expression repeatedly every time they refresh. Steaming aggregation makes it easy to drop the raw unaggregated data to avoid unnecessary load on the database.
Batch aggregation first stores raw metrics in the time series database and periodically fetches them and writes back the aggregated metrics. Because data is aggregated in batches over time, batch aggregation is typically done for larger swaths of data that isn’t time sensitive. Batch aggregation cannot skip ingesting the raw nonaggregated data, and it even incurs additional load as written raw data has to be read and rewritten to the database, adding additional query overhead.
The additional overhead of batch aggregation makes streaming better suited to scaling the platform, but there are limits to the complexity real-time processing can handle due to its real-time nature; batch processing can deal with more complex expressions and queries.
Rethink Observability, Control Your Costs
Before you adopt a cloud native observability platform, be sure it will help you keep costs low by enabling you to understand the value of your observability data as well as shaping and transforming data based on need, context and utility. Get more from your investment too, with capabilities that permit you to delegate responsibility for controlling cardinality and growth and continuously optimize platform performance.
The cloud native Chronosphere observability platform does all this and more. It helps you keep costs low by identifying and reducing waste. It also improves engineers’ experience by reducing noise. Best of all, teams remediate issues faster with Chronosphere’s automated tools and optimized performance.