Can eBPF Agent in Kubernetes Be the Key to Better Observability?
Israeli-startup Groundcover is using a new eBPF observability tool — called the Flora agent — that it says bests other application monitoring tools such as DataDog and OpenTelemetry when running on a Kubernetes node alongside New Relic’s Pixie agent and Groundcover’s Flora agent.
The Flora agent outperformed application performance monitoring (APM) competitor Datadog by more than three times, Groundcover stated, demonstrating minimal to zero overhead to the application’s CPU (+9%) and memory (+0%) while Datadog, OpenTelemetry and the Pixie agent had an overhead of 249%, 59% and 32% adobe the CPU baseline, respectively, and 227%, 27% and 9% above the memory baseline.
“All other solutions but Flora raised the resource consumption of the application dramatically and in an unexpected manner, potentially causing the application to reach CPU throttling that might degrade its performance or even create an out-of-memory (OOM) crash in a limited environment,” CTO Yechezkel Rabinovich stated in a blog post. “Flora also proved to be highly efficient in the total resources it consumed, making it the most cost-effective solution at high scale.”
When combining the resources consumed by the different agents tested and the overhead measured on the monitored application, Flora consumed a total CPU that was similar to the one used by OpenTelemetry and the Pixie agent, but that was 73% less than the CPU consumed by Datadog, the blog post stated. “Additionally Flora consumed 74%, 77% and 96% less memory than Datadog, OpenTelemetry and the Pixie agent, respectively,” it added.
The Flora agent was released in April at the KubeCon+CloudNativeConEurope 2023. Its leverages eBPF inside the kernel to access data about the application within Kubernetes.
Run Code Safely in the Kernel
In the past, it was hard if not impossible to get at some of the data that eBPF can achieve. Developers had to instrument the application in order to get the data, Rabinovich explained. Often companies are still not getting 100% of the data for observability; some are struggling to achieve 10% to 15% of observability data, he added.
Observability is generally split into three types of data:
- Traces (which monitor the pathways for interactions, such as end-to-end transactions and what happens between services)
Shahar Azulay, CEO of Groundcover, said it really makes a difference in large development shops, where time to value is zero.
“Traditional observability platforms require you to change your code,” he said. “Imagine what it does do time to value. We usually come across organizations with, say, 100 developers, so they’re already using different languages and a huge technology stack to integrate OpenTelemetry, which is the recommendation of the community, or Datadog, you will have to go through each of these themes, each running through their own instructions, and the fit of the specific stack, forwarding all that as a leader of the organization and pushing that to production. That takes weeks.”
With eBPF, one person, usually a DevOps site reliable engineer (SRE), can “just throw it into immediate installation on the cluster, and you’re recovering everything,” Azulay said.
“Suddenly, you can align everyone to the same depth, because you’re observing stuff from the kernel level, not from the application level. And that’s a mind-blowing difference than what than how observability — what’s the door observability vendor is going to to the organization, instead of the R&D team and the developers, they can go to the infrastructure,” Azulay said.
The test application was a basic HTTP server built in Golang (v1.19) that serves a configurable number to random JSON objects, performs a pre-configured amount of CPU-intensive tasks per each request it receives and returns its response in a Plaintext or Gzip format, according to the blog post. The test application was then tested in the different scenarios as is (for baseline) when instrumented according to relevant documentation of Datadog and OpenTelemetry, and when running on a Kubernetes node alongside New Relic’s Pixie agent and the Flora agent. Prometheus-based CPU and memory utilization metrics were generated for all test cases and were scraped and stored in a VictoriaMetrics database instance.
The infrastructure was a Kubernetes cluster with Node Taints that allowed Groundcover to isolate each deployment test case from the others. Every tested application flavor ran alongside the bare minimum components required for monitoring according to the relevant test case.
Groundcover used a K6 operator to generate the test load, with K6 test objects that executed from each of the separate Node groups. Groundcover used a custom-built K6 image that also exposes Prometheus metrics so it could get metrics from the client side as well for sanity purposes. The results were analyzed in Grafana, through a Prometheus data source integration that queried the deployed VictoriaMetrics instance, according to Rabinovich’s blog post.
CNCF paid for travel and accommodations for The New Stack to attend the KubeCon+CloudNativeConEurope 2023 conference.