Gauge Your Observability Agent Overhead to Save Costs
We recently looked at how an OpenTelemetry Collector is used to serve as a filter to monitor telemetrics. It is applicable for whenever multiple applications or microservices are involved, particularly for security considerations. As such, an OpenTelemetry Collector falls under the category of an observability agent. Observability agents, such as the OpenTelemetry Collector, include Fluent Bit, Vector and others.
Observability agents play a critical role in the nuts and bolts workings of observability. They handle data transport to ensure telemetry data is transmitted accurately. Agents typically offer data collection, data processing and data transport, playing a critical role in monitoring system performance. They help in identifying unknown unknowns to troubleshoot and mitigate performance issues before they become problems. That’s the golden standard of observability functionality.
In this way, an observability agent when used for data collection serves to collect data sent to it from one or many sources. In addition to receiving data, it sends data to an endpoint, such as for visualization with a Grafana panel. With it, it can be configured to collect certain types of logs, traces and metrics for observability.
Initially, you can opt not to use an observability agent, if you’re already deploying an application that is instrumented to send telemetry data directly to the observability platform. The collectors can be useful when monitoring an application that can’t be instrumented. Since that’s a very common use case as well, the collectors can be useful when monitoring an application that can’t be instrumented, since that’s a very common use case as well, Braydon Kains, software developer at Google, told The New Stack.
— BC Gain (@bcamerongain) November 6, 2023
Without observability collector functionality, you’d need to configure each backend or user monitoring separately for those, which can be cumbersome. On the contrary, an observability collector serves as a single endpoint for all microservices, streamlining access to applications and microservices through a unified point facilitated by the collector. Utilizing an observability agent to serve as a collector, you can view and manage microservices collectively, offering a consolidated view on a platform like Grafana. While Grafana provides certain alternatives without an OpenTelemetry collector, the collector significantly simplifies this process.
However, observability agents can consume many resources. To address this, they are or can be monitored themselves to ensure they do not excessively consume resources, thus preventing unnecessary costs. In other words, the OpenTelemetry Collector, Fluent Bit, Vector and others, exhibit high robustness and perform various tasks to achieve their remarkable outcomes, but their comparative performance can vary.
There are Kubernetes filters and processors in most popular agents that fetch metadata from the Kubernetes API to enrich logs and data. As Braydon Kains, software developer at Google, said during his own KubeCon + CloudNativeCon talk “How Much Overhead How to Evaluate Observability Agent Performance,” Fluent Bit and Vector are becoming more popular, in addition to OpenTelemetry. “Each agent also has ways to build custom processing if the defaults available don’t meet your needs,” Kains told The New Stack after the conference.
“The biggest challenge with this is mostly that doing anything on a pipeline handling megabytes of data per second will have a multiplicative effect on your overhead. Especially with regex log or JSON log parsing, the effects grow quickly,” Kains said. “If you’re having trouble sending data fast enough, I highly recommend increasing the number of workers or leaning into the threading implementation of the agent where possible.”
Exporting is one of the only steps in the pipeline that can easily be parallelized, Kains said. Most backends can handle timestamps being slightly out of order, while as a feature that Fluent Bit offers, setting up eight workers, for example, creates a thread pool of eight workers sending data simultaneously. This can significantly improve your pipeline’s efficiency by dispatching data to the thread pool and letting one of the workers handle the slower part, in case the default processes aren’t sufficient, Kains said.
How to Test
Organizations will often have to determine independently which agent is the best for them and what overhead to expect independently, Kains said. “The only way is to try running it. If you can replicate your production environment, install the agent, configure it and monitor metrics,” Kains said. “That’s the best way to get an answer.”
If replicating the production environment is challenging, consider using test workloads like a log generator or scraping Prometheus, Kains said. LogBench from AWS is a good log generator for testing log pipelines. For Prometheus scraping, set up a mock server with a copy of the text scrape. “If you expect high cardinality scenarios, especially for database metrics, force high cardinality situations to stress test the agent’s performance. If you’re dissatisfied with the evaluation results, consider doing less or offloading work to reduce resource usage. Aggregator nodes and backend processing can also help manage resource usage,” Kains said. “If you encounter unacceptable performance or find a regression, open an issue for maintainers with detailed information, including ways to replicate the issue and relevant performance data such as graphs, CSVs, Linux perf reports or pprof profiles.”
Kains’ team at Google uses Google Cloud Ops, which merges two agents, utilizing Fluent Bit for log collection and OpenTelemetry for gathering metrics and traces. Behind the scenes, a central configuration layer is maintained by the team, generating configurations for both OpenTelemetry and Fluent Bit. These configurations are optimized to suit users primarily working on virtual machines, such as plain VMs, ensuring efficient metric collection through OpenTelemetry.
A while back, we were interested to see if OpenTelemetry logging was ready to be used as part of the Ops Agent in favor of Fluent Bit, Kains said. “This would have allowed us to unify entirely on the OpenTelemetry Collector,” Kains said. “At the time, OpenTelemetry logging was not quite mature enough to stand up to Fluent Bit’s throughput and memory usage, so we opted not to go for it at the time,” Kains said. “We haven’t yet updated those benchmarks, so it’s hard to say how those would go today.”
However, for most general users, relying on Google infrastructure for benchmarking agents would otherwise be very expensive for end users and overly complex. “The benchmarks I ran would not have been reproducible by the community,” Kains said. “This is something I intend to work on over the new year, revising our benchmarking and performance evaluation strategies and technology to be open source and not rely on any Google-specific technologies or infrastructure.”
However, using AWS Log Bench or even the script that Kains’ team created, it is possible to manually generate logging loads for the agent and either watch and compare metrics directly through a tool on the VM like htop and collect the metrics with scripts that can gather information from /proc or something similar, Kains said. “I hope to create either guides or tools that can be open sourced to make this benchmarking more accessible to less technical users,” Kains said. “I don’t have any exact plans yet but I hope to have more to say in the coming months.”