How Comprehensive Observability Can Save DevOps from ‘Unknown Unknowns’
Honeycomb sponsored this post.
DevOps teams often deploy and manage cloud native applications across a variety of multicloud and on-premises environments. For many engineers charged with debugging or improving application performance, it may seem that they are staring into the daunting abyss of scattered deployments, with little hope of meaningfully understanding what parts of their code are working and what needs to be fixed. This is why comprehensive observability is now in especially high demand.
According to Honeycomb, comprehensive observability is a method of collecting and analyzing unique, high-cardinality telemetry data in order to provide the full context for any given event, or service request. That data is shipped to a data backend where it can be analyzed to help debug or troubleshoot the many unforeseen and complex issues that arise in cloud native distributed systems, such as Kubernetes.
Modern distributed systems can fail in unique and novel ways, with many of those newly discovered failure modes never to be seen again. Such anomalies can’t be addressed by existing monitoring, APM and logging solutions because these tools are designed to find known or predictable issues. They also don’t drill down past overall aggregate behavior and into individual user experience because they lack the capacity to effectively collect and analyze high-cardinality data, Honeycomb asserts. When operating at scale, standard monitoring metrics such as CPU or memory usage logs are insufficient to debug an issue when it’s difficult to even pinpoint where the code is running within the system.
Observability thus constitutes “the ability to understand what is happening inside of your systems and to do so without having to push new code, using the existing instrumentation and data flowing out of those systems,” Liz Fong-Jones, principal developer advocate, for Honeycomb and a member of the governance committee of OpenTelemetry, said during The New Stack’s livestream podcast as part of its KubeCon + CloudNativeCon coverage.
KubeCon Livestream w/Honeycomb https://t.co/lxhwttHIi4
— The New Stack (@thenewstack) November 18, 2020
“A comprehensive observability platform is more than a collection of logs, metrics, distributed tracing data and other analytics,” Fong-Jones said. “It’s proactively measuring your systems, rather than looking only when things are broken.”
In this article, we’ll cover what comprehensive observability should offer organizations working with highly distributed, multicloud environments.
The Evolution of Observability
It wasn’t that long ago when monitoring tools were largely used to detect errors and remediate fixes in the event of failure. In this way, the monitoring space was previously about “having static things” and what actions to take when outages occurred, for example, Jaana Dogan, principal engineer, Amazon Web Services (AWS) and fellow livestream guest, said. An application in production that crashed, for example, was remediated with predetermined actions while that approach doesn’t scale anymore, she said. Today, “the world is becoming extraordinarily large and complicated.”
Instead, observability is a dynamic process involving “trying to figure out what’s going on without being prepared so much for the failure,” Dogan said.
Observability involves analyzing “unknown unknowns versus known unknowns,” Fong-Jones said. “Monitoring was about measuring things that you knew to predict in advance, whereas observability helps you understand how and why.”
Viable observability tools today must proactively measure and analyze systems, and involve many use cases beyond just “break/fix,” Fong-Jones said. “Can you actually instrument your code as you’re writing it the same way you would write unit tests? Can you actually visualize what’s happening inside of your CI/CD pipeline?” she said. “To me, that’s what comprehensive observability means, and a lot of other people, unfortunately, take ‘comprehensive’ to mean implementing traces, logs and metrics, and I don’t think that’s the case at all.”
OpenTelemetry Open Doors
Organizations should be able to adopt a comprehensive observability platform without having to write code or do other heavy-lifting in order to implement the tools within their existing systems. The recently initiated OpenTelemetry project at the Cloud Native Computing Foundation (CNCF) is intended to provide a vendor-neutral framework to foster interoperability among the many options on offer by tool providers. In many ways, OpenTelemetry, also known as OTel, is a model for the future of observability.
“OpenTelemetry enables organizations to export instrumentation for code in a vendor-neutral fashion so they only need to instrument it once,” Fong-Jones said. “So, to me, observability is not about tooling but observability is about what humans are able to do with the tooling.”
“Buying yourself a Jaeger instance does not magically give you durability unless your engineers actually know how to use it,” Fong-Jones said. “So, the use of OpenTelemetry to generate and export that data, and in the backend, for instance, Jaeger, or a proprietary closed source backend [provides] a variety of options that help you really achieve… comprehensive observability.”
Looks Like A Duck
Limited metrics offered by some vendors certainly do not constitute observability. In some cases, for example, an APM microservices platform for tracing certain stack instances may be useful but does not by itself span the observability needs of many, if not most, organizations.
“Because you call it a duck doesn’t mean it is a duck, and what observability is, is a testament to the power of what Charity Majors, [Honeycomb’s co-founder and CTO] and Honeycomb, have been doing,” Chuck Daminato, a sit reliability engineer at ecobee, a Honeycomb customer said.
By using the query capabilities Honeycomb’s platform offers, for example, ecobee was able to detect and analyze some not-so-apparent issues customers might have experienced when signing up and logging on to its platform with mobile devices ahead of a new product launch.
“Honeycomb’s platform really helped us to find some corner cases when the application and business logic didn’t work. We weren’t cognizant of them, either, since they didn’t manifest themselves in ways that really impacted the application [significantly],” Daminato said. “Just being able to say ‘we’re getting this type of error for this type of request for this type of client,’ and the ability to quickly do a query as opposed to thinking from first principles all the different types of [data and metrics helped.]
“We were launching a couple of new products and services,” he said, “and I can say without a lie, that if it wasn’t for Honeycomb, then we would not have gotten there in time.”
Observability for DevOps Collaboration
A viable observability platform offers a number of features that help engineers quickly find the source of incredibly complex problems when they occur and also proactively find ways to improve production systems in highly distributed environments. In practical terms, observability should free software developers from the obligation of writing and managing instrumentation code that they then have to adjust for different on-premises data centers or cloud platforms, Torsten Volk, an analyst for Enterprise Management Associates (EMA), told The New Stack. “During crunchtime, developers need to be able to focus on finishing their application code and worry about the end user experience instead of having to update and test instrumentation functions.”
Comprehensive observability puts telemetry data in the hands of engineers along with tools to analyze that telemetry and interrogate their systems, in any arbitrary number of ways necessary, until they get to the root cause of reliability problems, performance bottlenecks and other issues.
“This helps observability platforms to become a collaboration space,” Volk said, “for developers, DevOps teams, SREs and traditional IT operators to continuously enhance application code and operations.”
Modern cloud native distributed systems offer many benefits over traditional monolithic systems. But the complexity introduced by microservice architectures also requires modern tools designed for modern systems. The seemingly daunting abyss associated with the challenges of finding ways to improve and debug application performance in often hybrid and multicloud environments is partially due to the limitations of older generations of tools that were designed for the needs of monolithic systems. Wrangling that complexity and equipping teams to better, and more proactively, understand their cloud-native systems becomes manageable thanks to comprehensive observability.
The Cloud Native Computing Foundation and KubeCon+CloudNativeCon are sponsors of The New Stack.
Feature image by Obelixlatino via Pixabay.