Observability in 2024: More OpenTelemetry, Less Confusion
It has been a chaotic and challenging year for many, if not most, IT teams. Amid the challenges are the rising cloud costs and pressures to optimize cloud spending. Cost-conscious strategies have been primarily delegated to DevOps for implementation.
In parallel, there has been a collective explosion in data to manage and monitor, especially among those organizations that have continued to scale, often engendering data-management and -monitoring requirements that cover multicloud and on-premises environments.
Observability has arguably emerged as the great savior in 2023, crucial for navigating the chaos caused by an explosion in data. However, with the surge in data, observability has created its own potential chaos.
The influx of telemetry data, including logs, traces and metrics, serves as the basis for decisions related to reducing cloud costs, optimizing redundancies, and predicting and solving business problems in IT before they occur, but the sheer magnitude of telemetry data generated has created just that much more complexity to manage.
While you can always use your Grafana panel to help consolidate this data into a single console, the demand for solutions that would help to fine-tune the data types sought and to converge it all in a single interface became an especially critical need in 2023.
“This year has also seen a strong economic pressure, which has made cost efficiency top of mind for all,” Dotan Horovits, principal developer advocate at Logz.io and a cloud native ambassador for the Cloud Native Computing Foundation (CNCF), said. “High-profile stories of excessive spend on observability have further surfaced the need for innovative solutions to improve the signal-to-noise ratio, as well as more flexible pricing schemes.”
The good news is that the major trends of 2023 involve the rise of standardization, especially with the progress of open telemetry as a common standard for implementing various observability tools and processes. This has helped to tame the telemetry-data beast by creating a single standard to pull together various observability tools into a single interface, which again, is typically a Grafana console.
Security obviously remains an ongoing concern, but there is promise, particularly through open source. For example, eBPF has become established as a mainstay for observability in security practices.
Additionally, the emergence of AI and machine learning (ML), including large language models (LLMs), aim to make sense of the observability data explosion. In 2024, the hope is that machines will play a more reliable role in parsing through observability data, reducing the burden on human analysis.
Thanks to advancements in OpenTelemetry and other emerging best practices, new tools and tweaks, 2023 marked a year when observability became more accessible, thanks largely to open source development.
Observability has transcended its traditional association with monitoring to find bugs and to resolve outages, and now extends its influence across different interfaces, tools, and demonstrating enhanced openness and compatibility to increasingly make forecasts.
These forecasts can involve predicting outages before they happen, cost shifts, resources usage and other variables that certainly would be much harder and mostly involve trial and error previously.
Indeed, as of KubeCon 2023 + CloudNativeCon earlier this year, a key development for OpenTelemetry consisted of how it now supports the three core observability signals: logs, metrics and traces, Torsten Volk, an analyst for Enterprise Management Associates (EMA), said. “This means that organizations can now use a single agent to collect observability data across their increasingly distributed and therefore complex universe of microservices applications,”
“This could significantly simplify one of today’s most significant pain points in observability: instrumentation. Developers can now benefit from the continuously increasing auto-instrumentation capabilities of OpenTelemetry and no longer have to worry about instrumenting their code for specific observability platforms,” Volk said.
However, such a freedom of choice due to a proliferation of tools has created challenges of its own.
“While I agree that observability has become more open to diverse tools, this has also become a challenge, with the tool sprawl and data silos that it created, which have posed significant challenges this year. Organizations are no longer struggling with getting the right data in. Rather, they struggle with the explosion of data, which is not only costly to store, but also becoming increasingly difficult to ‘find the needle in the haystack,’” Horovits said.
“Observability practice is therefore moving up the stack, from the focus on collecting logs, metrics and other signals, to extracting insights out of the data, across infrastructure and application, across signal types, sources and formats. To address this need, observability tools and vendors are shifting to a holistic observability solution, and are doubling down on the data analytics backend capabilities.”
Indeed, OpenTelemetry’s support and adoption should continue on a strong growth track.
“We see increased adoption among end users and observability vendors, further pushing out proprietary and open source telemetry shipping and scraping agents. I expect this is to accelerate now that the logging piece has reached general availability, meaning the project is now generally available across all the three pillars of observability (namely logs metrics and traces),” Horovits said. “Looking beyond the ‘three pillars,’ the focus this year has shifted to the new telemetry signal of continuous profiling. The project is also expanding beyond its backend-centered origins to also support customer-side telemetry and real-user monitoring use cases.”
CI/CD exemplifies this shift, with the development phase revealing key metrics, logs and traces, providing observability in ways unimaginable just a few years ago, thanks to the instrumentalization OpenTelemetry provides and with other processes and tools. Notably, widely used DORA metrics are now integral in assessing developer productivity.
“Lack of CI/CD observability results in unnecessarily long lead time for changes, which is a crucial DORA metric measuring how much time it takes a commit to get into production. CI/CD tools today emit various telemetry data, whether logs, metrics or trace data to report on the release pipeline state,” Horovits said. “It is only natural to follow suit with the same observability stack we use to monitor production environments, to also monitor the software release pipelines. Open source tools such as Prometheus, OpenSearch and Jaeger can serve to visualize the pipeline events, metrics and sequence, to diagnose flaky tests, faulty builds or issues in the build environment.”
However, work for CI/CD observability remains to be done.
“The challenge is that although many CI/CD tools emit telemetry, they do not follow any particular standard, specification or semantic conventions,” Horovits said. “This makes it hard to use observability tools for monitoring these pipelines.”
Security, Observability — and eBPF
Security remains a huge concern obviously, especially for the highly distributed and complex structure of Kubernetes resulting in an increased incidence of attacks in 2023 that have resulted in even more fines and terminations of those deemed responsible. This unfortunate trend is set to continue in 2024. The effectiveness of extended Berkeley Packet Filter or eBPF has shown a lot of promise in 2023, forming the basis for open source security tools that lend observability to help to prevent attacks.
The efficacy of eBPF primarily stems from its computational efficiency, as it is intricately linked to the Linux kernel. However, categorizing eBPF merely as a Linux kernel-based tool would be misleading, as its impact extends across the stack for applications to which it is applied through the use of hooks. An effective platform using eBPF should empower DevOps teams to monitor what should be running in the Kubernetes cluster and provide actionable results when policies are violated or security threats are detected.
Open source continues to lead the way for security. Two open source eBPF tools for security observability that shined in 2023 were KubeScape and Cilium. Kubernetes security provider ARMO offers a window of observability by covering the life cycle of applications and their updates for Kubernetes applications.
This encompasses IDEs, CI/CD pipelines and clusters for risk analysis, security, compliance, misconfiguration scanning and image scanning. Open source offerings also include hardening recommendations such as network policies and security policies. Kubescape integrates with a checklist of necessary tools for DevOps teams using the platform, such as software bill of materials (SBOM), signature scanning and policy controls. It initiates scans at the beginning of the development cycle, extending across CI/CD and throughout the deployment and cluster management process.
Cilium offers additional capabilities with eBPF to help secure the network connectivity between runtimes deployed on Docker and Kubernetes, as well as other environments, including bare metal and virtual machines. Isovalent, which created Cilium and donated it to the CNCF, and the contributors are also, in parallel, developing Cilium capabilities to offer network observability and network security functionality through Cilium subprojects consisting of Hubble and Tetragon, respectively.
Cilium primarily offers a container network interface (CNI) implementation, which is, in general, the central component of the Kubernetes network stack, CTO and co-founder Benyamin Hirschberg of ARMO said.
As Cilium is able to provide better performance using application-specific logic for Kubernetes clusters that enhances container-to-container traffic, “this project indeed improves its security offering beyond network policies that have been around for many years,” Hirschberg said. “Security observability is a major new development in Cilium, showing that users are seeking solutions in this space.”
Open source communities have a tendency to sort out short-term problems and give a kick-start for a project like Cilium to implement completely new functionality like Tetrago, which is essentially a runtime threat detection agent, just like Falco.
Kubescape incorporates eBPF data streams to provide a comprehensive understanding of the cluster’s security posture, Hirschberg said. By combining visibility into network traffic at the kernel level provided by eBPF, it adds real-time system behavior to additional findings like vulnerabilities and misconfigurations.
“This provides a view of reachable vulnerabilities, misconfigurations that can be fixed without breaking applications, attack path detection and even suggests the best network policies and seccomp profiles,” Hirschberg said.
Observability can be used to gather insights for improved business decisions and to alleviate the increasing scrutiny on IT budgets leading to reduced cloud costs. However, the right tools are essential to make sense of this surge in telemetry data. Needless to say, observability has arguably emerged as the great savior in 2023, crucial for navigating the chaos caused by an explosion in data. However, with the surge in data, observability has created its own potential chaos. LLMs and AI have begun to play a role that might be revolutionary. In 2023, we witnessed the initial indications of how AI applications will come into play, potentially marking the ultimate in observability. These processes and AI can conduct computational analysis in ways that humans are unable to achieve.
This manifests in two primary ways. Firstly, with good observability tools, the sheer quantity and magnitude of telemetry data can be parsed, categorized and hierarchized, allowing humans to comprehend and use it for making business decisions, resource allocations and other determinations. Simultaneously, AI can assess vast amounts of telemetry data, making decisions based on meta-telemetry data and eventually automating the decision-making process. Although it is assumed that humans will provide the final check, at least initially.
Another aspect involves AI initiating commands in a no-code, low-code manner, using common language to describe observability conclusions and desired insights. With the assistance of LLMs, the AI takes over the process. While these tools have yet to emerge, discussions about them were prevalent in 2023. Therefore, the influx of AI into observability in 2024 promises to be fascinating.
“LLMs will become ‘partners’ of DevOps teams, security engineers and SREs [site reliability engineers], continuously keeping an eye on incoming data, making recommendations for proactive optimization and helping fix problems quickly and with minimal business impact. LLMs have the enviable ability to keep an eye on unimaginable large and complex streams of operations data coming from DevOps, IT and business and can therefore help us humans prioritize actions based on almost complete context,” Volk said. “LLMs will also be able to provide us with the code necessary to automate more and more administrative tasks, allowing humans to focus on setting the policy guardrails for these automations.”
The ability to analyze a single stream of logs, metrics and traces makes data analytics more powerful to detect those “unknown unknowns,” Volk said.
The “LLM is especially ‘appreciative’ of the extra context provided by OpenTelemetry’s unified handling of all observability data,” Torsten said. “In 2024 we can expect more specific and more actionable insights, with a lower risk of missing relevant trends or exceptions.”