Groundcover: Simplifying Observability with eBPF
Where is the sweet spot between gathering and storing the observability data you actually need and getting a reasonable bill at the end of the month?
That’s the conundrum on which Israeli startup groundcover is focused, using eBPF, an extension of the original Berkeley Packet Filter, for monitoring Kubernetes applications.
Most monitoring solutions require too much developer effort to integrate and maintain, not to mention being really pricey, said CEO Shahar Azulay.
“Once you get the data, you get the bill at the end of the month, and you start to just enter kind of a cycle of trade-offs, where [the manager is] telling the developers, ‘…I know you worked hard to get that data. Now let’s kind of tune that back a bit because the cost is too high,’” he explained.
Too Much Observability Data
Chronosphere’s Martin Mao has written about the overwhelming flood of data being collected from thousands of microservices in cloud operations, and that data being locked into the features of traditional application performance monitoring (APM) vendors.
There are various approaches to dealing with the onslaught. Axiom, for instance, offers a serverless platform to enable users to cheaply store unlimited amounts of data. Others turn to sampling as a means to analyze the collected data at a reasonable cost.
Meanwhile, observability is considered an optimal use case for eBPF.
SAP Labs’ developer Gaurav Gupta previously called eBPF “Linux’s newest superpower,” for its ability to provide low-overhead tracing inside the kernel itself, offering insight into I/O and file system latency, CPU usage by process, stack tracing and other metrics.
Memory-mapped eBPF enables custom programs to run in isolated kernel-level virtual machines without requiring changes to kernel source code or having to deal with kernel module dependencies.
With a minimal footprint, it allows developers to attach their programs to various types of probes that run at specific points in the execution.
Code is executed by the kernel, as opposed to running in “user space” like standard applications. It lets you observe everything running in the user space from outside it, rather than relying on tools that operate in the user space themselves.
Users of eBPF include Facebook, Cloudflare, Netflix and Azure.
Taking an Outside View
“APM is the problem no matter what, whether you’re running Kubernetes or serverless, or whatever, there’s still a problem with monitoring what you’re doing in production. But Kubernetes kind of makes the problem worse,” according to Azulay.
Kubernetes is good at abstracting stuff away from the developers, he said, but there’s a cost associated with its ability to automatically scale that developers don’t begin to understand. And most APM offerings were created before Kubernetes.
“It doesn’t mean that they don’t know how to monitor Kubernetes, but they just don’t speak the language,” he said, adding that “there’s definitely value in creating an APM experience that it’s tailored all the way through to Kubernetes.”
As for what eBPF brings to the table: “What it means for observability is that you can monitor code without being part of the code, basically,” explained Azulay.
“So you can suddenly take an out-of-band approach into monitoring applications. If before you had to get the permission of the developer to be a part of his code, in order to monitor the code. Suddenly, you can do it after the fact, once the code is running in production, without even talking to the developer, by installing a separate agent that uses eBPF. It kind of gives you superpowers or X-ray vision to look at what applications are doing without being part of the application.”
For large organizations, it means rather than having to coordinate the code of each developer, “one DevOps guy or one production engineering guy” can cover observability for the entire company, he said.
The problem with traditional APM vendors is that the vast majority of data collected is never used (he puts that at 99%) and everything is sent to costly storage to be analyzed. If your cluster is facing 1 million requests per second, for instance, sampling might take that down to 10,000 requests per second. But if only one request of the original 1 million is noteworthy, it still might be missed.
Why Store More?
A better option is to store only what you need, which is the groundcover approach.
“Basically, without sending the data anywhere, a lot of the insights are already being digested on the fly. As data flows through our agent, it allows us … to break the trade-off of big cost and visibility depth. So we can price much lower because we make you store much less information,” he said.
Rather than taking Kubernetes data elsewhere for analysis, it’s digested inside each server in your cluster. And while Datadog and other vendors take “a very objective kind of perspective of data collection,” which he maintains makes you store more, groundcover offers an opinionated take on what to store, based on experience.
“In a sense, we know what you would like to look at when you’re troubleshooting, and groundcover makes a lot of decisions on what to sample, where to sample raw data and how to create an experience that we understand as developers and ex-users of these solutions,” he said.
The data on the groundcover dashboard is taken live from the customer’s cloud environment, but groundcover itself has no access to the data and does not store it. It’s stored privately in your cloud.
groundcover uses features such as CO-RE (compile once — run everywhere) to support the variety of distributions being used. Meanwhile, cloud providers keep pushing new kernel versions to production, enabling the use of new eBPF features soon after their release.
Open Source Tools
Founded in 2021, Azulay and cofounder Yechezkel Rabinovich based groundcover on frustrations they experienced while working in the Office of the Prime Minister of Israel and other enterprises. Azulay also was a former machine learning manager for Apple.
They raised $20 million in a Series A round last September.
In December, the company open sourced a tool called Caretta that creates a visual network map of the services running in your cluster. It has since gained more than 800 stars on GitHub.
The eBPF-based results can be digested directly as raw Prometheus metrics, or you can integrate them into Grafana.
It also released Murre, which fetches CPU and memory resource metrics directly from the kubelet on each K8s node.