Achieving the right outcomes based on the explosion of observability data, without mortgaging your business, is now a trending Twitter topic.
“Paying more for logging and metrics than you pay to run your app still fascinates me,” software engineer Elan Hasson chimed in on a thread about some of the big players in the observability space.
It’s remarkable how common this situation is, where an organization is paying more for their observability data (typically metrics, logs, traces, and sometimes events), than they do for their production infrastructure.
And for what purpose? If these organizations could draw a straight line from more data to better outcomes — higher levels of availability, happier customers, faster remediation, more revenue — this tradeoff might make sense.
But in many cases, this isn’t true. The community agrees — later in the thread, Hasson adds, “Paying more for logging/metrics/tracing doesn’t equate to a positive user experience. Consider how much data can be generated and shipped. $$$. You still need good people to turn data into action.”
I couldn’t agree more.
Cloud native applications and infrastructure are emitting increasing amounts of observability data — according to ESG, 71% of companies believe their observability data (metrics, logs, traces) is growing at a concerning rate — yet outcomes are getting worse, not better.
How do we know?
According to a study from PagerDuty, critical incident volume across the platform rose 19% from 2019 to 2020, and they are continuing to rise at an ever-faster rate. So if observability data continues to grow at an unsustainable pace while outcomes are getting worse, it’s time to rethink our approach to controlling observability data. Here are the four ways we can start to tackle this problem:
Retention: Most companies default to 13 months retention for all data. But in the modern cloud native architecture, where we are deploying multiple times a day, and a container is only around for a couple of hours, a huge amount of that modern observability data does not need to be retained for 13 months. One tactic for reducing your data footprint is setting the optimal retention period for each data type. For example, you might only need to keep observability data from your lab environment for two weeks if the environment is torn down and rebuilt on a bi-weekly basis anyways.
Resolution: This refers to the frequency data is being emitted — for example tracking the CPU every 10 seconds versus every minute versus every hour. Similarly to retention, one size does not fit all for resolution. In a continuous integration/continuous delivery (CI/CD) use case — you do automated deploys, so tracking every second or 10 seconds makes a lot of sense. In contrast, other use cases — such as capacity planning or long-term trend analysis — don’t require that data down to a per-second basis. A small change here can have a big impact: by measuring every 10 seconds versus measuring every minute, there is a 6x difference in the amount of data that needs to be produced and stored.
Efficient storage. A lot of data for observability is time-series data — which means it’s a measurement of something over a period of time. Using relational databases, or key-value stores, or blob stores are not efficient ways to store this data. Instead, you need a storage solution, such as time series databases, that are purpose-built for this type of solution.
Data aggregation. Arguably, this is the most impactful tactic for taming data growth. A common pattern among companies is to emit a high volume of data that also has a lot of dimensions (aka high cardinality). The goal is to be able to slice and dice your data by those dimensions to quickly hone in on where a problem is occurring. This offers a huge advantage, but it also produces a lot of low-value data. By aggregating combinations of dimensions that provide useful insights while discarding a large amount of the raw underlying data, organizations can significantly reduce their data footprint.
In an era of massive observability data growth, the organizations that can efficiently use their data to drive positive outcomes will come out on top. While organizations don’t need to tackle all four of these approaches at once, these actions will set a solid foundation for achieving this objective.
PagerDuty is a sponsor of The New Stack.
Feature Photo by John Moeses Bauan on Unsplash.