DevOps / Monitoring / Contributed

Beyond the 3 Pillars of Observability

9 Jun 2021 1:00pm, by

Martin Mao
Martin Mao is the co-founder and CEO of Chronosphere, the company redefining monitoring for the cloud native world. He was previously at Uber, where he led the development and SRE teams that created and operated M3, one of the largest production monitoring systems in the world storing tens of billions of time series and analyzing billions of data points per second in real-time. Prior to that, he was a technical lead on the EC2 team at AWS and has also worked for Microsoft and Google.

Gartner defines observability as the evolution of monitoring into a process that offers insight into digital business applications, speeds innovation, and enhances customer experience. Today, the DevOps movement and cloud-native architecture are enabling digital businesses to become more competitive, which is driving a need for great observability.

Before DevOps, engineers rarely thought about operating the systems they built. Engineers now need to think about building systems that are easier to observe. To better understand how observability impacts outcome, engineers should think about the answers to three critical questions:

  1. How quickly do I get notified when something is wrong? Is it BEFORE a user/customer has a bad experience?
  2. How easily and quickly can I triage the problem and understand its impact?
  3. How do I find the underlying cause so I can fix the problem?

Regardless of what instrumentation exists, and what tools or solutions are employed, the ability to answer the above three questions is what observability should be focused on.

What Observability Is Not

Today, there are many who define observability as a collection of data types — the three pillars: logs, metrics, and distributed traces. Rather than focusing on the outcome, this siloed approach to observability is overly focused on technical instrumentation and underlying data formats.

Simply having systems emit all three data types doesn’t guarantee better outcomes. What’s more, many companies find little correlation between the amount of observability data produced and the value derived from this data.

Break Observability Down into 3 Phases

We’re not the first to criticize the three pillars. We agree with much of the critique that others — like Charity Majors and Ben Sigelman — have put out there. Instead of the three pillars of observability, we’ve developed an approach to observability that is focused on the outcomes instead of the inputs, and we call it the three phases. The phases are focused on positive observability outcomes and the steps teams can take to achieve these goals.

The traditional three pillars observability — logs, metrics, and distributed traces — outdated, overly-focused on technical instrumentation and underlying data formats, rather than outcome.

During each phase, the focus is on alleviating the customer impact — or remediating the problem — as fast as possible. Remediation is the act of alleviating the customer pain and restoring the service to acceptable levels of availability and performance. At each phase, the engineer is looking for enough information to remediate the issue, even if they don’t yet understand the root cause.

Phase 1: Know about the Problem

Knowing an issue is occurring is enough to trigger a remediation. For example, if you deploy a new version of a service and an alert triggers for that service, rolling back the deployment is the quickest path to remediating the issue without needing to understand the full impact or diagnose the root cause during the incident. Introducing changes to a system is the largest source of production issues, so knowing about problems as these changes are introduced is key.

Keys to success:

  • Fast alerting: Shrink the time between a problem occurring and a notification firing.
  • Scope notifications to just the teams that need to act: Scope the problem and route it to the right teams from the start.
  • Improve signal-to-noise ratio: Ensure that alerts are actionable.
  • Automate alert setup: Automated or templatized alerting can help engineers know about problems without a complicated setup process.

Tools and data:

  • Alerts
  • Metrics (native metrics as well as metrics generated from logs and traces)

Phase 2: Triage the Problem

Understanding the scope of an issue can lead to remediation. For example, if you determine that only customers in one experiment group are impacted, turning off that experiment would likely remediate the issue.

To help engineers triage issues, they need to be able to quickly put the alert into the context of understanding how many customers or systems are impacted, and to what degree. Great observability allows engineers to pivot the data and shine a spotlight on the contextualized data to diagnose issues.

Keys to success:

  • Contextualized dashboards: Having alerts directly link to dashboards that show not only the source of the alert, but related and relevant contextual data.
  • High cardinality pivots: Allowing engineers to further slice and dice the data allows them to further isolate the problem.
  • Leverage existing instrumentation: It’s not practical to always assume that every use-case is instrumented perfectly, so it’s important to be able to leverage existing instrumentation, but have them link as best possible for best contextualization.

Tools and data:

  • Dashboards
  • Metrics
  • Logs

Phase 3: Understand the Problem.

Doing a post mortem on an incident is often an exercise in navigating a twisted web of dependencies and trying to determine which service owner you need to work with.

Great observability gives engineers a direct line of sight linking their metrics and alerts to the potential culprits. Additionally, it provides insights that can help fix underlying problems to prevent the recurrence of incidents.

Keys to success:

  • Easy understanding of service dependencies: Identifying the direct upstream and downstream dependencies of the service experiencing the active issue.
  • Ability to jump between tools and data types: For complex issues, you need to repeatedly jump between details given by logs and traces to the trends and outliers given by metrics on dashboards and ideally in a single tool.
  • Time to root cause: Sometimes it’s impossible to avoid having to perform root cause analysis during an incident and in those situations, having probable causes surface in alert notifications or during triage using dashboards reduces time to root cause.

Tools and data:

  • Traces
  • Logs
  • Metrics
  • Dashboards

Conclusion

Great observability can lead to competitive advantage, world-class customer experiences, faster innovation, and happier developers. But organizations can’t achieve great observability by just focusing on the input and data (three pillars). By focusing on the three phases and the outcomes outlined here, teams can achieve the promise of great observability.

The New Stack is a wholly owned subsidiary of Insight Partners. TNS owner Insight Partners is an investor in the following companies: Real.

Feature image via Pixabay.

A newsletter digest of the week’s most important stories & analyses.