Q&A: Why Observability Data Sampling Can Cost DevOps Teams Time and Money
The shift to highly distributed, cloud native environments continues to require more fine-tuned levels of observability. And while the pandemic has further accelerated this trend as developer and other DevOps teams become even more geographically dispersed, a centralized system to pool and analyze high-cardinality telemetry data has become that much more critical.
To help DevOps teams extend the reach of their observability tools, Honeycomb, a leading observability provider, has released Refinery, an open source sampling proxy. Refinery further “refines” the high-cardinality data analysis organizations increasingly require. Refinery also complements Honeycomb’s enterprise offerings that include service-level objectives (SLOs), its patented Secure Tenancy, training and other support.
The New Stack recently spoke with Christine Yen, CEO and co-founder of Honeycomb, about how Refinery and Honeycomb’s enterprise offerings can help DevOps teams meet today’s often highly demanding observability needs and a number of other challenges. Such challenges include how to meet customer needs by having access to the right high-cardinality data on an as-needed basis in a way that remains affordable.
What problems does Refinery solve for DevOps teams struggling with observability? How were they getting by before?
For the vast majority of applications, the small incremental cost of collecting observability data is incredibly minimal and unobtrusive. But for applications running at scale — applications that generate tens of billions of events per month or hundreds of billions per year — suddenly every byte generated can add up to immense amounts of data and traffic to manage. At that scale, transmitting and saving every single bit of observability data that your system generates dramatically outweighs the benefits that data is intended to provide in the first place. When your stream of observability data instead turns into a raging flood, there’s a tradeoff to consider: Exactly how much data do you need to achieve the desired result, without breaking other systems, or your bank?
For enterprises that operate at that hundreds-of-billions scale, the reality of most applications is that a vast majority of their events are successful and virtually identical. So they don’t need to keep that entire torrential flood of data. Instead, what they need in order to debug their applications effectively is to keep a representative sample of successful (or “good”) events, against which they can compare the unsuccessful (or “bad”) events. That’s sampling in a nutshell. Sampling their observability data gives them a way to ensure they still get a complete picture of activity in their system, all while keeping costs under control.
But, until now, that sampling approach has been a terrible experience. Most observability tools take one of three approaches to sampling:
- A naive, blunt-force constant sampling approach.
- A black-box, “trust in the vendor’s logic” sampling approach that doesn’t allow input from the customer on what matters to them.
- Or they don’t sample at all, which at a large scale, results in a lot of money wasted.
At Honeycomb, we believe in providing good defaults, but always empowering our customers to retain control — and that’s true of our sampling strategy with Refinery as well. Our users can specify which traces are less interesting than others and sample them at different rates, capturing rare events frequently and frequent events rarely. They send Honeycomb a fraction of their observability data, and our visualizations will do the math behind the scenes to reconstruct a view of the entire data set from just the samples received.
A core differentiator with Refinery is that it puts you in control and it gives you flexibility, choice and transparency. It’s open source so that anyone can understand how it works; it’s not a proprietary magic box. We give you several intuitive yet sophisticated methods to make the sampling decisions that are right for you; we don’t mysteriously make them for you. With Refinery, you’re always in control of the data you sample and, therefore, how much your observability solution will cost.
How does Refinery differ compared to Honeycomb’s previous sampling proxies?
Our previous sampling solutions were inflexible — their sampling rules were tightly coupled in the code. Specifically, you had to link our custom dynamic sampling libraries and set configuration at compile-time, rather than at runtime. Refinery offers much more flexibility, not just in configuration, but also in deployment and scaling options.
Refinery runs on a customer’s infrastructure. It puts them in total control of their data and saves them money on not just the observability costs, but also network egress costs for all their cloud-based applications.
Can you elaborate on the tradeoff between providing the sampling data that engineering teams require to deliver the best user experiences and the cost associated with access to this high-cardinality data?
Absolutely. This may be answered in response to the intro question, but I’m happy to add more detail. Sampling, in general, is an approach for managing large volumes of data.
No companies operating at scale retain all of their data exhaust for operational and reliability purposes, nor would they want to — you’d need infrastructure for your monitoring and observability tooling that would dwarf your actual production clusters.
Sampling allows customers with high traffic volumes to get a sketch of what’s happening in their systems without incurring the costs of a huge, hi-res picture. Refinery enables customers to get a hi-res view of the details that matter in that sketch. The rise of distributed tracing has normalized, to an extent, the idea of sampling.
For instance, you can’t keep 100% of all traces at Facebook or Netflix scale. So how do you pick the most interesting and representative N percent? But for those of us without Facebook- or Netflix-sized budgets, the optimization problem becomes: How can you get the best sketch for the lowest price?
Refinery takes the conversation one step farther: Our users can decouple the application logic itself (in the code) from the logic for “most interesting” (usually impacted by the characteristics of the production traffic hitting said code), but still customize it to their heart’s content. Because our product managers can’t know what will be more interesting in your traffic patterns than you can, we endeavor to provide a solid baseline while allowing you to then make it your own.
Put another way, the conditions that drive changes to a sampling strategy are almost always external to the code itself, unpredictable, and extremely specific to your business. A couple of examples of this occur when:
- One customer among many is suddenly driving an increased amount of traffic.
- Your engineering leadership introduces a performance initiative to eliminate requests over N seconds in duration, which means you then find events lasting N+ seconds extremely interesting.
Refinery allows our customers to tune those parameters in order to keep a handle on the volume of data sent to their monitoring and observability solutions (and thus, a handle on their bills). They can mix and match sampling schemes and rules. They can also test those strategies in dry-run mode so that they don’t accidentally lose any critical data while they’re testing out new approaches.
There is much discussion about the so-called “three pillars of observability.” How and why does this concept create duplicate data stores? Why is this unnecessary?
Honestly, my colleague Danyel Fisher, [a principal design researcher for Honeycomb], explains this extremely well in his The New Stack article “How the ‘3 Pillars of Observability’ Miss the Big Picture.”
The three so-called pillars are just three different types of data you might use for observability. The three-pillars discussion misses the point. It’s not about the data. It’s about what you do with that data. When the discussion focuses on three different data types, it’s easy to lose sight of the fact that you’re still just trying to solve the same one problem. Why are you trying to solve that one problem with three different data sets that duplicate data in different data stores managed by different types of tools? Vendors that prefer that definition of observability like it because they have three different tools to sell you.
Observability is about the new and unique ways you can understand what’s happening inside your production services by allowing you to see your services in cohesive ways that separate tools just can’t show you, no matter how much you try to connect them. It’s unnecessary to have one tool where your structured logs live, another tool for your traces, and yet another tool to see other events like metrics. When we talk about the challenges of running at scale, that type of duplication just exacerbates the problem. It’s incredibly inefficient, wasteful, and, most of all, ridiculously expensive for enterprises to maintain it all.
Instead, what we should be doing is focusing on how we solve that one problem with one cohesive set of capabilities. Engineering teams need one solid methodology for finding problems in their production services. And for those times where they find they start caring about the underlying data types because their spend is exponentially growing, that’s where Refinery comes in. Just like Honeycomb focuses on refining your observability experience, our sampling product helps you refine the experience of managing your observability data at scale.
How and why does Refinery consolidate observability data in a single data store (and thus replace the concept of “three pillars of observability”)?
Refinery isn’t what is responsible for consolidating observability data. Refinery — and sampling in general — is just what makes it cost-effective to have a reliable observability experience when you are sending massive loads of data. But on the topic of having a single data store, we recognize that each “pillar” can be computed from some other data source.
At Honeycomb, we essentially see all of the data in those “pillars” as the same. You can take any log data, throw some trace IDs in there to define a hierarchical structure, and essentially make it into a trace. You can compute the value of any metrics from log data or from parts of traces. That’s the red herring of the “three pillars” marketing. All of that data is essentially the same, and each of these “three pillars” can be transformed into another. Asserting that there are three different important data types is a relic of vendors, that (conveniently [for them]) have three legacy products they’re gluing together. They’re more concerned about the three SKUs they’d like to sell, rather than thinking about what problems their customers are trying to solve with said data.