Top Ways to Reduce Your Observability Costs: Part 1
This is the first of a two-part series.
Organizations are constantly trying to figure out how to balance data value versus data cost, especially as data is rapidly outgrowing infrastructure and becoming costly to store and manage. To help, we’ve created a series that outlines how data can get expensive and ways to reduce your data bill.
This article covers how cloud native architecture increases data growth, what cardinality is and how you can curb data costs.
The Cloud Native Observability Challenge
Companies of all sizes are rapidly moving to cloud native technologies and practices. This modern strategy offers speed, efficiency availability and the ability to innovate faster, which means organizations can seize business opportunities that simply aren’t possible with a traditional monolithic architecture.
Yet moving to an architecture based on containers and microservices creates a new set of challenges that, if not managed well, will undermine the promised benefits.
- Exploding observability data growth. Cloud native environments emit a massive amount of monitoring data — somewhere between 10 and 100 times more than traditional VM-based environments. This is because every container/microservice is emitting as much data as a single VM. Additionally, service owners start adding metrics to measure and track more granularly to run the business. Scaling containers into the thousands and collecting more and more complex data (higher data cardinality) results in data volume becoming unmanageable.
- Rapid cost increases. The explosive growth in data volume and the need for engineers to collect an ever-increasing breadth of data has broken the economics and value of existing infrastructure and application monitoring and tools. Costs can unexpectedly spike from a single developer rolling out new code. Observability data costs can exceed the cost of the underlying infrastructure.
As the amount of metrics data being produced grows, so does the pressure on the observability platform, increasing cost and complexity to a point where the value of the platform diminishes. So how do observability teams take control over the growth of the platform’s cost and complexity without dialing down the usefulness of the platform? This article describes the trade-offs between cost and value that can come with investing in observability.
Cardinality: A Primer
To understand the balance between cost and insight, it’s important to understand cardinality. This is the number of possible ways you can group your data, depending on its properties, also called dimensions.
Metric cardinality is defined as the number of unique time series that are produced by a combination of metric names and associated dimensions. The total number of combinations that exist are cardinalities. The more combinations possible, the higher a metric’s cardinality. Here’s a delicious practical example: purchasing fine cheese.
Understanding Data Sets
If your only preference is that the cheese you buy is made of sheep’s milk, your data would have just one dimension. Analyze 100 different kinds of cheese based on that dimension, and you’d have 100 data points, each labeling the cheese as either sheep’s milk–based or not (made from another source).
But then you decide you only want sheep’s milk cheese made in France. That would add another dimension to track for each cheese made of sheep’s milk — the country of origin. Think of all the cheese-producing countries in the world — about 200 — and you can understand how the cardinality, or the ways to group the data, can quickly increase.
If you then decide to analyze the data based on the type of cheese, it adds many hundreds of other dimensions for grouping (think of all the different kinds of cheese in the world).
Finally, you decide you want to only consider Camembert, and group Camembert cheese only by whether it was made with raw milk, warm milk or completely pasteurized milk. That’s three more dimensions. You’d be right in thinking that, with all these dimensions, the cardinality would be high, even in traditional on-premises, VM-based environments.
A key point, it’s difficult to calculate the overall cardinality of a data set. You can’t just multiply together the cardinality of individual dimensions to know what the overall cardinality is — you will frequently have dimensions that only apply to a subset of your data.
With the transition from monolithic to cloud native environments, there’s been an explosion of metrics data in terms of cardinality. This is because microservices and containerized applications generate metrics data an order of magnitude more than monolithic applications on VM-based cloud environments. To achieve good observability in a cloud native system then, you need to deal with large-scale data and take steps to understand and control cardinality.
From 150,000 to 150 million metrics with cloud native architecture
|Legacy (virtual machine) environment||Cloud native environment|
In addition to cardinality, it’s important to understand two other terms when managing data quantity in an observability platform: resolution and retention.
- Resolution is the interval of the measurement — how often a measurement is taken. This is important because a longer interval often smooths out peaks and troughs in measurements, making them not even show up in the data. Time precision is an important aspect of catching transient and spiky behaviors.
- Retention is how long high-precision measurements are kept before being aggregated and downsampled into longer-term trend data. Summarizing and collating reduces resolution, trading off storage and performance with less accurate data.
Ways to Keep Costs Low: Data Sampling and Retention
Each team has to continuously make accurate trade-offs between the cost of observing their service or application and the value of the insights the platform drives. This sweet spot will be different for every service, as some have higher business value than others, so those services can capture more dimensions, with higher cardinality, better resolution and longer retention than others.
This constant balancing of cost and derived value also means there is no easy fix. There are, however, some things you can do to keep costs in check.
1. Use Downsampling
Downsampling is a tactic to reduce the overall volume of data by lowering the sampling rate of data. This is a great strategy to apply, as the value of the resolution of metrics data diminishes as it ages. Very high resolution is only really needed for the most recent data, and it’s perfectly OK for older data to have a much lower resolution so it’s cheaper to store and faster to query.
Downsampling can be done by reducing the rate at which metrics are emitted to the platform, or it can be done as it ages. This means that fresh data has the highest frequency, but more and more intermediate data points are removed from the data set as it ages. It is, of course, important to be able to apply resolution reduction policies at a granular level using filters, since different services and application components across different environments need different levels of granularity.
By downsampling resolution as the metrics data ages, the amount of data that needs to be saved is reduced by orders of magnitude. Say we downsample data from one second to one minute, that is 60 times less data we need to store. Additionally, it vastly improves query performance.
A solid downsampling strategy prioritizes which metrics data (per service, application or team) to downsample and helps determine a staggering age strategy. Often organizations adapt a week-month-year strategy to their exact needs, keeping high-resolution data for a week (or two) and stepping down resolution after a month (or two) — and, after a year, keeping a few years of data. With this strategy, teams retain the ability to do historical trend analysis with week-over-week, month-over-month and year-over-year.
2. Lower Retention
By lowering retention, we’re tweaking the total amount of metrics data kept in the system by discarding older data (optionally after downsampling first).
By classifying and prioritizing data, we can get a handle on what data is ephemeral and only needed for a relatively short amount of time, such as dev or staging environments or low-business-value services, and which data is important to keep for a longer period of time to refer back to as teams are triaging issues. Again, being able to apply these retention policies granularly is key for any production-ready system, as a one-size-fits-all approach just doesn’t work for every metric.
For production environments, keeping a long-term record, even at a lower resolution, is key to being able to look at longer trends and being able to compare year-over-year.
However, we don’t need all dimensions or even metrics for this long-term analysis. Helping teams choose what data to keep, at a low resolution, and what metrics to discard after a certain time will help limit the amount of metrics data that we store, but never look at again.
Similarly, we don’t need to keep data for some kinds of environments, such as dev, test or staging environments. The same is true for services with low business value or non-customer-facing (internal) services. By choosing to limit retention for these, teams can balance their ability to query health and operational state without overburdening the metrics platform.
In the next and final installment of this series, we’ll include a primer on classifying different types of cardinality before diving into our last two tips on reducing observability costs: lowering retention and using aggregation. Stay tuned.