Favorite Social Media Timesink
When you take a break from work, where are you going?
Video clips on TikTok/YouTube
X, Bluesky, Mastodon et al...
Web surfing
I do not get distracted by petty amusements

Why Events Are the Critical Telemetry Type You’re Missing

Give events a chance; knowing what changed is essential to identifying and resolving problems.
Dec 14th, 2023 7:28am by
Featued image for: Why Events Are the Critical Telemetry Type You’re Missing
Featured image by Burak The Weekender on Pexels.

In a meeting last year with a bunch of senior observability leaders from cloud native companies, I asked everyone to tell me their least favorite telemetry type: metrics, events, logs, traces or whatever. I was pretty confident the dominant answer would be logs. Nothing against logs, but I had recently heard this group express the hot take that “during an incident, if you’ve gone to the logs, you’ve already failed.”

I was wrong. To my surprise, they answered almost unanimously: events. Events were the most despised telemetry type. I followed up by asking, why do you dislike events so much? Again the answer was nearly unanimous: Lack of definition about what they are and how you can use them.

I get it. In researching events, I’ve found four or five different definitions, and no one seems to have nailed down the best way to use them in a troubleshooting workflow.

Since that meeting, our team has spent a lot of time thinking about events and how we can make them useful as a first-class telemetry citizen. The team did extensive research and then got to work building a function to track change events. Just recently, we announced the ability to ingest events in our observability platform.

I want to step back and explore why events are so critical and how they can help.

Events Tell You What Change Caused an Issue

Change is the leading cause of errors. In a steady state, a system should continue to operate consistently for an indefinite period of time. Unfortunately, in a modern DevOps environment, our systems change dozens of times a day. We ship new code, we turn on and off feature flags, we deploy new infrastructure, we scale it up and down and we even change observability solutions. And business doesn’t stand still either; it’s in constant flux based on the time of day, day of the week, season of the year, world events, competition and a million other factors we can’t track.

The only way to stay on top of change is to contextually link your systems so that when you get an alert, you can quickly see what occurred in the same time frame that might have introduced the breaking change. This is what we call an event.

Observability UI showing an event alert

What Is an Event Anyway?

An event is a discrete change to a system, a workload or an observability platform. Here are some examples of events and how they might help you troubleshoot an issue:

  • System change: These are the types of changes that most people think about when it comes to events. Examples might be an autoscaling action, a configuration change or a feature flag. These changes can be found by digging into the relevant CI/CD, feature flag or infrastructure management tools, but that takes precious time.
  • Workload change: This is the most common blind spot for organizations. Examples might be onboarding a new customer or a business event like a sitewide sale. Contextualizing your other telemetry data with these events can reduce unnecessary investigation and Slack chatter (time) when folks are trying to determine why their telemetry suddenly looks different when there were no relevant system changes.
  • Observability platform change: These events could be an alert firing or being muted. It could also be a new data aggregation rule taking effect that causes the shape of the data to change.

System integrations that can create change events

How Do Events Fit in with Other Telemetry Types?

Like an observability signal, events cannot stand alone. They play an important role in the troubleshooting workflow alongside metrics, traces and logs. While metrics can tell you the symptom of a problem and are the primary driver in mean time to detect (MTTD) results, events can quickly tell you what changed. Alongside tracing, which will help you find the location of the problem, events help you remediate and stop the customer pain. From there, you might dig into the logs to start understanding why the problem happened so that you can get to the root cause and fix the underlying issue.

We call this workflow the three phases of observability: Know about an issue, triage it and then understand it, all while working toward remediation as quickly as possible.

Three phases of observability

Give Events a Chance

I originally called this piece “in defense of events,” and hopefully now you understand why and are open to giving them a chance. They complement and enhance your other telemetry types, hopefully making it faster to get critical context into your alerts.

Want to see more? Request a demo to see it in action.

Group Created with Sketch.
THE NEW STACK UPDATE A newsletter digest of the week’s most important stories & analyses.