Monitoring vs. Observability: What’s the Difference?
Modern distributed applications cannot be effectively monitored by legacy methods, which are based on handling predictable failures. With microservices architecture now the de facto standard for web applications, effective debugging and diagnostics require that the system be observable — that is, its internal state can be inferred by observing its output.
However, the line between observability and monitoring is often blurry for development teams. In this post, we discuss monitoring, observability, and the relationship between the two. We also mention some of the tools you can use to achieve observability.
Organizations that have embraced the DevOps mindset usually start decomposing the application to a microservices architecture, in order to gain operability and reduce repair time if an incident happens. But as their systems become more complex, they must ensure that they can still gain visibility on — and react in a timely manner to — system failures.
According to the SRE book by Google, your monitoring system needs to answer two simple questions: “What’s broken, and why?” Monitoring allows you to watch and understand your system’s state using a predefined set of metrics and logs. Monitoring applications lets you detect a known set of failure modes.
Monitoring is crucial for analyzing long-term trends, for building dashboards, and for alerting. It lets you know how your apps are functioning, how they’re growing, and how they’re being utilized. However, the problem with monitoring complex distributed applications is that production failures are not linear and therefore are difficult to predict.
Having said this, monitoring is still an indispensable tool for building and running microservice-based systems. If the monitoring rules and metrics are straightforward and focused on actionable data, they will provide a reasonably good view of your system’s health. Although monitoring may not make your system wholly immune to failure, it will provide a panoramic view of system behavior and performance in the wild, allowing you to see the impact of any failure and consequent fixes.
Observability, which originated from control theory, measures how well you can understand a system’s internal states from its external outputs. Observability uses instrumentation to provide insights that aid monitoring. In other words, monitoring is what you do after a system is observable. Without some level of observability, monitoring is impossible.
An observable system allows you to understand and measure the internals of a system, so that you can more easily navigate from the effects to the cause — even in a complex microservice architecture. It helps you find answers to questions like:
- What services did a request go through, and where were the performance bottlenecks?
- How was the execution of the request different from the expected system behavior?
- Why did the request fail?
- How did each microservice process the request?
Observability can be divided into three primary pillars:
- Logs: Timestamped, immutable records of discrete events that can identify unpredictable behavior in a system and provide insight into what changed in the system’s behavior when things went wrong. It’s highly recommended to ingest logs in a structured way, such as in JSON format, so that log visualization systems can auto-index and make logs easily queryable.
- Metrics: The foundations of monitoring, metrics are counts or measurements that are aggregated over a period of time. Metrics will tell you how much of the total amount of memory is used by a method, or how many requests a service handles per second.
- Traces: For an individual transaction or request, a single trace displays the operation as it moves from one node to another in a distributed system. Traces allow you to get into the details of particular requests to determine which components cause system errors, monitor flow through the modules, and find performance bottlenecks.
The Relationship Between Observability and Monitoring
Observability and monitoring complement each other, with each one serving a different purpose.
Monitoring tells you when something is wrong, while observability enables you to understand why. Monitoring is a subset of and key action for observability. You can only monitor a system that’s observable.
Monitoring tracks the overall health of an application. It aggregates data on how the system is performing in terms of access speeds, connectivity, downtime, and bottlenecks. Observability, on the other hand, drills down into the “what” and “why” of application operations, by providing granular and contextual insight into its specific failure modes.
While monitoring provides answers only for known problems or occurrences, software instrumented for observability allows developers to ask new questions in order to debug a problem or gain insight into the general state of what is typically a dynamic system with changing complexities and unknown permutations.
Building a Continuously Observable System
Achieving observability doesn’t have to be difficult. There are numerous key metrics pertaining to your application that you can begin with; such as your application’s CPU, network, and memory.
System logs are also essential in ensuring a system’s observability. Although logs can grow quickly and become difficult to manage — and expensive to store — there are tools that can increase the effectiveness of logging. An example is OpenTelemetry, which is used not only for logging, but also for metric collation and tracing. OpenTelemetry integrates with popular frameworks and libraries as well, such as Spring, ASP.NET Core, and Express.
Tracing makes your observable system more effective and allows you to identify the root cause of an issue in a distributed system. Tracing can be seen as the most important part of observability implementation: understanding the causal relationship in your microservices architecture and being able to follow the issue from the effect to the cause, and vice versa.
Continuous automated observability lets you stay on top of any risks or problems throughout the software development lifecycle. It provides visibility across the entire CI/CD pipeline and your infrastructure, giving you fast feedback on the health of your environment at any time — including in pre-production phases.
Tools for Observability
To manage distributed system infrastructures, you’ll need a dedicated set of tools to visualize your operational states and notify you when a failure occurs. These tools allow you to understand system behaviors and prevent future system problems.
In our white paper “Monitoring vs. Observability”, we describe in some depth the three leading observability platforms: OpenTelemetry, Jaeger and Zipkin.
There are also many tools that can help in instrumenting your application. A good example is our own tool Thundra, which provides an automated plug-play solution for observability while keeping the door open for manual instrumentation that’s compatible with OpenTracing — and soon OpenTelemetry.
A Final Note: Observe and Improve
Apps in production fail for various reasons. No matter how much effort you expend, there will always be something that goes wrong. If you don’t effectively instrument your application’s components to be observable, you’ll have a hard time debugging production issues. On the other hand, even an observable system doesn’t have the answers to all issues.
You need to continuously examine the data you have, to determine its usefulness. Observability must have the right data to help you get answers to known and unknown problems in production. You have to constantly adapt your system’s instrumentation until it’s appropriately observable, to the point where you can get answers to any questions needed to support your application at the level of quality you want.