Observability: Working with Metrics, Logs and Traces
Fault tolerance, no single point of failure and redundancy are prominent design principles in modern software systems. But that doesn’t mean errors, degradation, bugs or even the occasional catastrophe don’t happen. The complexity of distributed systems’ architectures makes surfacing incidents more of a whodunit mystery than just looking for a red dot on a screen.
The concept of monitoring software systems is far from novel. The concept of observability first appeared in 1960 but got its legs more recently as software systems required further insight beyond what traditional monitoring provided. Traditional monitoring focuses on individual areas of the software system. This type of monitoring can identify an issue in one part of the system such as the network, database or server but can’t track the request life cycle or surface an issue in another part of the system. The concept of observability, in contrast, centers around collecting data from all parts of the system to provide a unified view of the software at large.
The Three Pillars of Observability:
Metrics paint the overall picture of a software system. They monitor baseline performance, pinpoint anomalies when they occur and identify trends. Metrics aren’t one size fits all. Whoever is collecting metrics determines which metrics to collect. Popular metrics that many businesses or developers collect are CPU utilization, network traffic, latency or user signups.
Metrics focus on one area of the system making it hard to track issues across a distributed system. Best practices suggest collecting data in regular intervals and using numerical values to store metrics.
Logs are records. They differentiate from metrics in that they record events, but, similar to metrics, these events are anything deemed important by the business or software. This includes anything from general information, to any event surpassing a certain threshold, to warnings and errors. The historical record created by logs provides insight into issues within a software environment. When an error occurs, the logs show when it occurred and which events correlate to it.
There’s a delicate balance with logs between being helpful or harmful. Log data is a delicate subject. Providing data is why logs exist, but it can’t be just any data. Too much data might overwhelm a system and lead to higher latency. Logs don’t have authorization to reveal sensitive or private data. And then there’s storage. The useful lifespan of a log is short. Their usefulness diminishes as they get older. Store logs long enough to fulfill their lifespan and evict or archive accordingly to reduce overhead storage so they don’t overwhelm a database.
Traces track the end-to-end behavior of a request as it moves through a distributed or microservice system. The data collected in distributed tracing brings higher visibility to requests that use multiple internal microservices. Traces provide insight into how a request behaves at specific points in an application, such as proxies, middleware and caches, to identify any forks in the execution flow or network hops across system boundaries.
OpenTelemetry (OTEL) is a standardized, open source framework consisting of tools, APIs and SDKs that simplify the collection of telemetry data. Metrics, logs and traces fall under the category of telemetry data. By removing vendor lock-in and creating available tooling for all, OTEL aims to drive innovation in the observability space. The result is access to a wider set of options for developers to use when analyzing their logs, metrics and traces. This leads to greater ease of use when it comes to implementing observability best practices. It’s an incubating project with the Cloud Native Computing Foundation (CNCF) and resulted from the merger of the OpenCensus and OpenTracing projects. The developer community supports OTEL and a rich community sprung up around it.
Breaking free of the restraints of vendor lock-in opens software systems up to numerous database options for storing telemetry data. Of those options, a strong place to store telemetry data is in a purpose-built time series database. First, observability measures a software system over time, and time series databases store high volumes of data written and queried across ranges of time. Analyzing time series data, like metrics, requires querying data across ranges of time. Those queries are easy to execute with a time series database and difficult for other database types to execute efficiently.
InfluxDB 3.0 and Observability
InfluxDB 3.0, launched in April, makes working with OTEL more accessible than ever. The new database engine was built on top of Apache Arrow and brought many performance improvements over previous versions of InfluxDB. The database now supports unlimited cardinality data without affecting performance, which translates into 100 times faster queries against high-cardinality data. InfluxDB 3.0 ingests and queries data in real time, making it ideal for applications that require real-time analytics, such as observability. Real-time analytics also makes identifying anomalies much faster.
- What schema do we follow?
- How do we convert traces to line protocol?
- How does InfluxDB connect with the larger observability ecosystem?
The InfluxDB team has a new tutorial that includes a full repo and code walkthrough centered around how to collect traces, logs and metrics. The tech stack is InfluxDB 3.0, Jaeger and Grafana.
Software systems are becoming more and more complicated, but identifying and solving bottlenecks, bugs and errors doesn’t have to be. Metrics, used in traditional monitoring, are numerical records of different parts of the system. Logs record historical events. Traces provide visibility into how requests behave as they move throughout the entire system. Just like everything else, telemetry data is just data. Its true value lies in what you do with it.
InfluxDB is purpose-built to work with time series data, and telemetry data is time series data. The creation of OpenTelemetry opened many doors when it came to how developers handled observability practices. Working with OTEL and InfluxDB brings the power of real-time analytics, unlimited cardinality and fast querying to your telemetry data.