Grappling with Observability Data Management
Most observability environments use a range of data stores for the different data they need to ingest and analyze. Each plays a unique role, with different pros and cons.
A constant refrain from our customers is the need to merge their sprawling monitoring and observability environments. Many want a single tool or pane of glass acting as the sole source of truth. While the desire for consolidation is understandable, infrastructure environments are composed of a range of best of breed software — each with its own benefits and challenges. A key factor in this fragmentation is the range of underlying data stores used to manage the terabytes of observability data flowing into these platforms daily. Each optimizes for different types of data and analysis.
Listed below are some of the most common data stores presently in use in enterprise organizations.
Time Series Databases
As the name implies, time-series databases (or TSDBs) optimize for time series data — things like stock tickers, medical monitoring device data, weather data, and so on. In the observability space, TSDBs are commonly used to store time-stamped metrics about application or system performance. These metrics are collected at fixed time intervals, called the resolution.
Why not use a relational or non-relational database, like PostgreSQL, Oracle, Cassandra or MongoDB? You certainly can, but the optimizations that TSDBs offer make them ideal for metrics storage and analysis. Unlike more general-purpose databases, TSDBs offer things like time-aware analysis functions, filtering, compression and summarization. Less specialized databases don’t have these features. For observability data, TSDBs are optimized to store metrics and do not specialize in full log parsing and retention.
On the opposite end of the spectrum, object stores are incredibly popular data stores today because of their flexibility. Unlike traditional file systems that store data hierarchically, object stores keep data as, well, objects. The store itself doesn’t know what’s in the object, meaning you can put whatever you want in there. Object stores have become the backbone of several data management concepts, like data lakes and cloud data warehouses.
Object stores are ideal in the observability space, because you can store the vast amounts of data that observability requires at one-eighth the cost of block or file storage. If you’re using an observability pipeline or similar infrastructure, you can replay your data from object stores — effectively giving you a full Kappa architecture with low cost and high flexibility.
The power of object stores is their flexibility, but it’s also one of the downsides. Getting visibility into what’s in those objects and why they exist is often difficult without a robust metadata management practice. You still must manage that data, even if you store it cost-effectively. Like many things in IT, there’s no free lunch.
Indexed Log Analytics Platforms
These platforms, like Elastic or Splunk, are ideal for investigating performance problems or errors over a large corpus of data — primarily through a search-based interface. Indexed log analytics platforms consume event-based log data, like access logs or server logs. They have their own proprietary indexing and storage formats; and stripping ingested data of low-value elements, they perform exceptionally well.
The downside of indexed log analytics platforms is the amount of infrastructure these systems require. Ingesting tens or even hundreds of terabytes into these systems every day is common, resulting in massive infrastructure budgets to support their data growth challenges. The other downside is the pricing model. These platforms commonly charge based on the amount of data they ingest daily, limiting their usefulness when budgets fail to grow at the same rate as the data you want to put in them.
A new data management concept, introduced in 2020, is the data lakehouse — which attempts to combine the elements of a data warehouse with a data lake. Let’s take a step back from the lakehouse and review what makes a data warehouse and data lake different.
A data warehouse is a central repository of information with specialized data structures, composed from data ingested from transactional systems throughout the business. Data warehouses are highly optimized for business intelligence and analytics, and are organized into multiple subject and time domains. They’re great when you have a lot of known data and you want to ask repetitive questions over it — especially if you need to support hundreds or thousands of concurrent users.
If a data warehouse is appropriate when you know the data and questions you want to ask, the data lake is for when you don’t know one or either of those things. The data lake is a system of exploration, or a question development environment. Commonly built on object storage, a data lake ingests any type of data and makes it available for analysis for highly skilled data scientists and data engineers.
The data lakehouse is a composite of these two extremely different ideas, which is what makes it challenging to build and use. By trying to do everything, it’s likely the data lakehouse will perform poorly in times of high demand — like during quarter close, when many reports must be generated and ad hoc analysis is also ongoing.
For infrastructure, operations and SecOps teams, the lakehouse looks appealing because of its purported flexibility — but it’s not optimized for observability data such as logs, metrics, events and traces. It doesn’t have any opinion about how that data should be structured, optimized and consumed. Who wants to write a bunch of Spark or Python code to analyze data, when existing systems will do that for you with search?
Which One Do You Need?
Each of these data stores has benefits and drawbacks because each is optimized for different use cases and interaction patterns. For our customers, most use a mix of indexed log analytics platforms, time-series databases, and object storage. It’s this mix of platforms that causes the pain: each option needs its own data ingestion strategy, often requiring unique skills and tools.
This is where an observability pipeline comes in. Using an observability pipeline to abstract the sources of observability data from its consumers gives our customers flexibility to send data to the right platform, in the right format, at the right time — regardless of the underlying data store.
A common scenario with our customers is sending data from one agent, like the Elastic agent, to multiple destinations, like Amazon S3. Agent metrics land in Elastic’s service, while full-fidelity data lands in an object store like Amazon S3.
Upgrades are another scenario where an observability pipeline comes into play. When a new version of your TSDB is released, you can send data to both instances and test the updated release with live data. This removes the risk of cutting over with production data before thorough testing has occurred. Sending data to multiple destinations also allows you to test new platforms with live data, without making massive changes to the rest of your infrastructure.
Observability data takes several forms, and the platforms analyzing it each have different optimizations and make different assumptions about how that data should be used. There’s no one size fits all solution, which is why pursuing an options-based strategy with an observability pipeline gives you the most flexibility and value from your observability data.