Modern Observability Is a Single Braid of Data
Observing systems has always meant two things: identifying what happened, and what caused it to happen. To do that, we’ve always used logs and metrics. (A few people also used tracing, but as a niche tool for performance analysis.)
Hence, the whole three pillars thing.
The Life We Live
This pillar narrative is fundamentally wrong. Wrong, I tell you!
The problem with “three separate pillars” is how we use them. We don’t use these tools in isolation when observing our systems. To be useful, we have to bring all of them together, like some kind of awkward family group hug.
Here’s what actually happens.
An alert happens because one metric has gone squiggly. So, we load up a dashboard and squint to try and find other metrics that went squiggly at the same time. Literally, that is what we do. In the past I’ve even put a ruler or a piece of paper on the screen to look at what lined up. (A high five and a facepalm if you’ve done this as well.)
Then based on that information, we guess what the problem might be. Crazed shrews that we are, we scurry around scrounging together some logs to try and reconstruct the chain of events while also digging into conf files looking for anything surprising.
But which logs? Which conf files? The metrics dashboard knows very little. We have to infer this information, then look in separate tools that know nothing of the alerts and metrics that we care about.
This is Why We Suffer
The point is, correlating activity across multiple data sources is a precondition for actually solving our problems. This is where the narrative drives off into the woods. The pillars create artificial barriers between our data sources.
It is true that we’ve had to work this way due to the way traditional monitoring and application performance management (APM) tools happen to have been implemented. But that is actually an implementation detail, not canon.
Think of it as a historical accident. We slowly accreted different tools over time and as a result, the data happens to live in several disconnected data stores, which requires a human to link everything together in their brain.
But it doesn’t have to be like this. So let’s pause for a minute, clear our heads and reevaluate this situation from the beginning, using OpenTelemetry as our guide.
Our journey starts with events. Events are a collection of attributes. We want these attributes structured so you can index your events properly and make them searchable. We also want these attributes structured in an efficient manner to avoid sending duplicate information.
Some of those attributes are unique to the event; these are event attributes. The event timestamp, log message, exception details, etc., are all event attributes.
But there are not as many of these as you might think. In fact, most attributes are common to a sequence of events and recording this information over and over on every event would be a waste.
We can pull these attributes out into envelopes that surround the events, where they can be written once. Let’s call these envelopes context.
There are two types of context: static and dynamic. In OpenTelemetry, these context types are called resources and spans.
Static context (resources) is where the event is taking place. These attributes describe where the program is executing and how it was configured. Service name, version, region, etc. Any important information in a conf file can become a resource.
Once the program starts, the values of these resource attributes usually do not change. But with span attributes, the values change every time the operation executes.
Dynamic context (spans) is how the event is happening. Request start time, duration, HTTP status and HTTP method are all examples of standard span attributes for an HTTP client operation. In OpenTelemetry, we call these attributes semantic conventions, and we try to be consistent in the way they are recorded. But there can also be application-specific attributes added by application developers, such as Project ID and Account ID.
Dynamic context is also where causality comes in — we want to know what has led to what. To do that, we need to add four attributes to our spans: TraceID, SpanID, ParentSpanID, Operation Name.
With that, all of our events are now organized into a graph, representing their causal relationship. This graph can now be indexed in a variety of ways, which we will get to in a bit.
This event graph leads us to recognize the first artificial and unnecessary separation between two of the three pillars. Tracing is actually just logging with better indexes. When you add the proper context to your logs, you get traces almost by definition.
Now, because tracing has a niche history, perhaps some of you don’t believe me. So try this thought experiment.
Imagine you are investigating an incident. What do you do when you find a relevant log? An error perhaps. What’s the very first thing you want to do? You want to find all of the other relevant logs!
And how do you find all of the relevant logs? Pain and suffering, my friends, pain and suffering.
Think about how much time and effort you put into gathering those logs through searching and filtering; that is time spent gathering data, not time spent analyzing data. And the more logs you have to paw through — an ever growing pile of machines executing an ever increasing number of concurrent transactions — the harder it is to gather up that tiny sliver of logs that are actually relevant.
However, if you have a TraceID, gathering those logs is just one single lookup. Indexing by TraceID allows your storage tool to do this work for you automatically; you find one log and you have all the logs in the transaction right there, with no extra work.
Given that, why would you ever want “logs” without these “trace” IDs?
Let’s Talk About
Now that we’ve established what events are, let’s talk about events in aggregate.
We look at event, span and resource attributes in aggregate to find patterns.
The value of an attribute could occur too often, or not often enough, in which case we want to count how often these values are occurring. Or the value may exceed a certain threshold, in which case we want to gauge how the value is changing over time. Or we may want to look at the spread of values as a histogram.
We usually call this kind of analysis “metrics,” and we tend to think about it as somehow separate from logs (aka events). But it’s important to remember that every metric emitted correlates with a real event that actually happened. In fact, any metric you might see in a typical dashboard could easily be generated from the standard resource, span and event attributes provided by OpenTelemetry. So, once again, the separation between these pillars is starting to look nebulous.
It really gets mixed up when we start to look at correlations.
“High latency is correlated with kafka.node => 6”
“Increased error rate is correlated with project.id => 22”
“A caused B”
This type of automated correlation detection is a powerful tool. We want these correlations, because identifying them is often our first real clue and leads us toward identifying the cause and developing a solution.
Correlations may occur between attributes in a span. Correlations may occur between spans in a trace. Correlations may occur between traces and resources.
Therefore, individual metrics are not enough. We have to make comparisons across multiple metrics to find these clues. If our tools are separate, the context will be missing and will be forced to do this analysis in our head. That is a lot of work.
But, correlations are objective facts. When events are properly contextualized as traces, they can be analyzed in aggregate and these relationships can be extracted automatically.
Once we’ve reached this point, what is the difference between traces, logs, and metrics? How is there any separation at all?
Tying It All Together
Providing the correlating of all of this data is how OpenTelemetry works.
It is true that OpenTelemetry will have interfaces for metrics, traces and logs. However, under the hood, everything is automatically connected because all these instruments share the same context.
Therefore, distributed tracing isn’t a niche tool for measuring latency; it’s a tool for defining context and causality. It’s the glue that holds everything together.
I’m often asked how OpenTelemetry will actually change our practice of observability. What is actually new? And how will it be any different?
Subjective information: based on interpretations, points of view and judgment.
Objective information: fact-based, measurable and observable.
We all know there is going to be a lot of hype around artificial intelligence and observability. The kind of AI we are talking about will not be able to make subjective decisions with any degree of accuracy. In the end, identifying real problems is just one more manifestation of the halting problem. That part is still on you.
However, identifying correlations and fetching all of the relevant information so that you can browse through it — that is all objective decision-making. Computers can totally do that for you!
So, what is OpenTelemetry, really, and how does it represent modern observability?
OpenTelemetry is data structures that enable automated analysis.
How much time do we spend trying to gather data and identify correlations before we can propose hypotheses and verify causations?
The answer is “a lot.” A lot of time. So much time. And saving you that time is significant enough to change the quality of our practice.
That is the real difference. No pillars, just a single braid of structured data.