TNS
VOXPOP
What news from AWS re:Invent last week will have the most impact on you?
Amazon Q, an AI chatbot for explaining how AWS works.
0%
Super-fast S3 Express storage.
0%
New Graviton 4 processor instances.
0%
Emily Freeman leaving AWS.
0%
I don't use AWS, so none of this will affect me.
0%
Cloud Native Ecosystem / Networking / Observability

How Observability Helps Troubleshoot Incidents Faster

With complex modern systems, you have to rely on telemetry tools and follow the data to figure out what went wrong, why and how to fix it.
Mar 30th, 2022 7:00am by
Featued image for: How Observability Helps Troubleshoot Incidents Faster
Featured image via Pixabay.

It all starts with the dreaded alert. Something went awry, and it needs to be fixed ASAP. Whether it’s the middle of the night and you’re the on-call responder, or it’s the middle of the afternoon, and your whole team is working together to ship a bundle of diffs, having an incident happen is extremely disruptive to your business — and often very expensive, making every minute count.

So how can observability (o11y for short) help teams save precious time and resolve incidents faster? First, let’s explore the changing landscape from monitoring to observability.

Debugging Using Traditional Monitoring Tools

Savannah Morgan
Savannah is senior technical customer success manager at Honeycomb. She is passionate about helping users find creative solutions for complex problems. When she is off the clock, Savannah can be found at the park with her family, binge-watching Netflix or spoiling her big pup, Bruce.

The key to resolving an incident quickly is to rapidly understand why things went wrong, where in your code it’s happening, and most of all, who it affects and how to fix it.

Most of us learned to debug using static dashboards powered by metrics-based monitoring tools like Prometheus or Datadog, plus a whole lot of instinct and intuition — the more experience you have, the better you get at guessing what’s happening. This isn’t exactly the most scientific or methodical approach. You can’t use static dashboards to slice and dice, dig deeper into the data or follow a trail of breadcrumbs from the problem to the cause because they are exactly that: static. What you can do is build up a mental library of things that have broken in the past and get good at pattern-matching dashboards from one incident to the next.

Hmmm,” you might think, “this feels like the caching problem we had last Thanksgiving after the MySQL replicas for this shard started lagging.” So then you dive into your dashboards and dig up the graphs for cache latency and MySQL replication lag, and voila! — hopefully, they confirm your hunch.

When This Works — And When It Doesn’t

This approach works reasonably well when your system repeatedly fails in predictable ways. This makes sense because traditional monitoring tools — dashboards, logs, and metrics — were originally built for monolithic systems with far fewer moving pieces. For a long time, most of the complexity of the application was bound up inside the application code itself, and there were relatively few component types.

In a LAMP stack model, for instance, you generally have a web tier, load balancers, a database and your application code, which you may even treat like a black box to some extent. You would have monitoring checks for each tier (connection counts for the database, requests/errors/latency for your webs, health stats and perhaps a few more metrics for your apps), and a trusty playbook that covers how to react to each alert and resolve them.

But modern systems are very different. Modern systems are not complicated, they are complex, and they don’t fail over and over again in familiar ways. You are likely to have not one database but many storage systems; your services are likely ephemeral, dynamic, and autoscaled — and there are many of them. You may rely on third-party APIs, serverless or components hosted by other vendors. You cannot possibly write up a playbook to cover every failure scenario. Every time you get an alert, it is likely to be something you have never experienced before and couldn’t have predicted.

You cannot lean on your intuition, nor can you expect to be able to reason about these systems a priori or to have personally experienced these failures before. Instead, you need to learn to rely on your tools. Let your telemetry be your eyes and ears. You need to learn how to ask questions, form hypotheses, use your tooling to swiftly validate them or invalidate them and methodically follow the data to find the answers to your question, step by step, every time.

Instrumentation with Metrics vs. Events

How is this possible? The answer lies in the fast feedback loops, explorability and rich view of the system’s inner life that you can only get with observability. To explain how we must begin with a high-level overview of instrumentation and how it differs from monitoring to observability.

Monitoring tools are built on top of a data type called the “metric,” a single number with some tags appended to it. For example, here are three custom metrics:


When you are instrumenting your code, you might use code like this to create and submit those custom metrics:


You can draw a graph for any custom metric or combination of custom metrics you define in advance. For example, I could graph the count of requests received over time or graph the average, 50th, 90th, 95th, 99th percentile of the request duration. I could not, however, graph the count of requests received by my user ID, or the duration of the requests for my user; nor could I do something like “statsd.set(‘username’, my.name)” and then use string operators like “prefix matches ch*” or “contains within charity”. I am limited to the metrics I have defined upfront.

Instrumentation for observability works a bit differently. Instead of being based on metrics, o11y is based on arbitrarily-wide structured data blobs (aka events), as shown, a single blob per request per service:


Instead of firing off StatsD metrics at random as you instrument your code, you instead initialize an empty structured event as soon as the request enters your service, then populate the event with any telemetry you want — any parameters, but also anything that seems like it might be useful to your future self in retracing your steps or understanding your code: shopping cart ID, user ID, language internals, environment details, literally whatever you like. The more, the merrier. (If you’re not using a helper tool like Honeycomb, be sure to include the unique request ID and trace ID!) Also, be sure you instrument any database queries, web requests, and so forth, capturing the elapsed times, raw queries, normalized queries and responses inside the blob.

Then, when the request is ready to exit or error from the service, it gets bundled up and shipped off to your observability tool as a single-wide event. (The sample blob you see above would have hundreds of dimensions in a real, live instrumented system.)

The beauty of gathering data this way is that the only aggregation performed is done around the request path, as experienced by your end user. All that detail gets transformed into rich context for the user, and if you pass the context along as parameters, it will even persist from hop to hop and service to service!

Why Instrumentation Is Fundamental

Metrics have no connective tissue to each other. You may emit 35 different metrics while your request is executing. Still, you can’t ask questions on the other side, such as, “Hey, are any client browser types significantly slower at executing the same requests?” or “Are the 500 response codes higher for a particular endpoint, or all by one specific user?” Events have connective tissue. You can slice and dice and ask yourselves new questions all day long, using any combination of details you thought to gather in your events.

Also? You have to manage the cost explosion of custom metrics hardcore, while adding more dimensions to a wide event is effectively free.

So the way you instrument your code and gather up your data is huge. But that’s not all there is to observability. You also have to ship this data into a tool that can process it adequately.

Observability Data = High-Cardinality and High-Dimensionality

The material differences between monitoring and observability tooling comes down to this: Observability tools handle data that is both high-cardinality and high-dimensionality and do it by encouraging explorability and experimentation, not static dashboards.

Those are some pretty big words, so let’s start by defining them.

“High-cardinality” refers to the number of unique elements in a set. Imagine you have a dataset with loads of data about 100 million users. Low-cardinality fields, or dimensions, would be the ones with not that many possible values. Like, “number_of_hands” is probably only going to have possible values of 0, 1 or 2, and the only value of “species” is presumably “human”. Other dimensions like “favorite_dessert” or “pairs_of_shoes” could have much higher cardinality, and any dimension that’s a unique ID (like “social_security_number”) will be the highest possible cardinality.

This matters because metrics storage engines are designed to deal with low-cardinality dimensions, and they fall apart when you feed them high cardinality values — they will blow out their keyspace and send you a huge bill, then shortly stop accepting reads and writes altogether. Yet it should already be clear that high-cardinality dimensions are the most useful ones for debugging because they are the most identifiable. It is better to track a problematic request down to a single unique trace_id than down to requests using one of five storage engines.

True observability solutions will happily accept high cardinality dimensions — and not just a token one or two or three. Any dimension should be able to support infinitely high cardinality, or it’s just not observability.

“High-dimensionality” is the sibling of high-cardinality. Think of it this way: The wide structured events that observability is built on are made up of lots of key-value pairs, and cardinality refers to the values (how many of them are you allowed to have per key), while dimensionality refers to the keys (how many of them are you allowed to have per event).

This matters because the wider your events, the more context you are collecting about what is happening and what your user is experiencing. Therefore you can ask more powerful questions, correlate more outliers and ultimately understand far more deeply what is going on.

O11y gathers all the precious context and organizes it around the request path as experienced by your end user. Being able to understand how each user individually experiences your code in real-time and correlating any and all outlier dimensions provides a fundamentally different way of understanding profoundly complex and unpredictable systems (users are definitely a high-cardinality dimension ;)). If traditional static dashboards deliver blunt force like a sledgehammer, o11y tooling is like a scalpel in comparison.

Explorability Defines Observability User Experience

In the upcoming O’Reilly book “Observability Engineering: Achieving Production Excellence,” the authors explain that good instrumentation allows engineers to answer the following questions whenever new code is deployed:

  • Is your code doing what you expected it to?
  • How does it compare to the previous version?
  • Are users actively using your code?
  • Are there any emerging abnormal conditions?

As the authors point out, if you capture sufficient instrumentation in the context of your requests, you can systematically start at the edge of any problem and work your way to the correct answer every single time, with no guessing, intuition or prior knowledge needed — no magic, just science.

Instead of staring at static dashboards, then making an intuitive leap to jump straight to the end (“I know! It smells like a Redis problem”), you instead instrument your code with lots of context and clues, then feed it into an observability tool that lets you slice and dice and explore the telemetry open-endedly.

This allows you to start at the edge (“There’s a latency spike” or “Users are reporting timeouts”) and methodically work your way to the answer by asking one question after another. For example: “Is the latency slowing down across ALL endpoints, or only one endpoint?” Answer: It appears to be just the ticketing /export/ endpoint. “OK, is it across all hosts, or only one host?” Answer: It appears to be across all hosts. “OK, is it timing out for all users or only one user?” Answer: It is only timing out for one user. “OK, can I see a sample trace to see where the time is going?” Answer: I’ve uncovered all the information needed to solve the problem.

In short, investing time upfront to instrument code is key to observability. This investment will pay off in spades during an incident, and it will continue to help the team resolve incidents and understand the behavior of their software in the future. You will earn your time back many times over with compound interest.

How Observability Helps Speed up Incident Resolution

I’ll be perfectly candid: o11y doesn’t actually speed up incident resolution time across the board. If you have already experienced this problem before and can simply jump straight to the answer key, that’s always going to be faster than having to work through a problem and debug it. Sorry.

If it consistently takes me one to three minutes to slice and dice through my telemetry and find the origin of a particular error spike or latency burst, that may not feel super if I’m used to intuiting the solution and jumping to the answer within 10 to 30 seconds.

But if you’re used to relying heavily on your intuition, ask yourself how reliable it is these days — and how long it takes for you to fall back on your tools. One to three minutes is better than a frantic, open-ended hour or more, or “all hands on deck” for the whole team.

Plus, you can teach people to use tools. You can’t sync your mental state with the new hire’s, but you can share your bookmarks and comments, and documentation.

When your organization has end-to-end observability, and when you are all fluent in your telemetry tooling, there is no need to guess or stab around in the dark — or hope that somebody in your org has already experienced this particular outage and left a clue in the runbook.

An Example from Real Life

In this scenario, we’ll use Honeycomb to walk you through an incident where you have noticed a latency spike and you want to track down the source. Our latency for this service is measured via “duration_ms”, so first, let’s generate a heatmap of that:


Now, you could start iterating through all the other dimensions to see which ones correlate with the spike in latency, but instead, let’s use BubbleUp instead as a shortcut. It precomputes all dimensions and diffs the ones inside the yellow highlighted box versus the baseline outside the box. BubbleUp will then differentiate and sort out these dimensions to see all of the ways your selected area is different, all in a straightforward clicking motion.

Immediately, we can see what exactly in this heatmap sticks out as a potential deviation from baseline events. In this case, it’s showing us that trace.parent_id, trace.trace_id, trace.span_id, app.user_id, infra.hostname are different from our baseline. But the one with the most significant difference and likely causing the most impact is the app.user_id.

Now let’s explore why the app.user_id specifically gives us trouble. One of the things that qualifies Honeycomb as an observability tool, not a static dashboard, is the way you can always explore and dig deeper. The further you drill down, the more you’ll understand what’s going on, how it’s different and what the impact is. To get a closer look, just click on this box and ask Honeycomb to zoom in on the events that include this problematic field.

Honeycomb automatically adjusts the query to include the outlier user_id.

Zooming in even further, you can go to Traces, where you’ll notice that they all come from the ticket API.

After just a few seconds, you now know which user likely experienced the incident, what endpoint they were engaged with and the trace events associated with the issue. That’s huge! You might continue to investigate individual traces to look for exact errors and the spans that have the highest latency.

Modern Systems Need More Modern Solutions

With observability, there’s no need to grep through files hunting for magic strings. Likewise, there is no need to guess; simply follow the trail of data.

Observability allows you to rapidly and easily iterate on the core investigative analysis loop by guiding you through the question-answer-question loop. Additionally, because Honeycomb is an observability tool that handles high-cardinality and high-dimensionality data, you can definitively pinpoint why things went wrong, where in your code it happened, and, most of all, who was affected and how to fix it.

Want to Try Observability out Yourself?

Sign up for a free Honeycomb account and instantly start getting insights from slicing and dicing your data any way you want (once it’s been instrumented, of course). If that’s too much of a commitment, check out Honeycomb Play, which is a self-guided demo that shows how observability can surface issues faster for quicker resolution.

Finally, Honeycomb is thrilled to sponsor The New Stack’s upcoming IRConf on Friday (April 1). Learn more here: https://www.irconf.io/#speakers

Group Created with Sketch.
TNS owner Insight Partners is an investor in: Pragma, The New Stack, Real.
THE NEW STACK UPDATE A newsletter digest of the week’s most important stories & analyses.