Open Source Histograms: The Future of Telemetry Monitoring
Latency measurements have become an important part of IT infrastructure and application monitoring. The latencies of a wide variety of events — like requests, function calls, garbage collection, disk IO, system-call, CPU scheduling, etc. — are of great interest to engineers operating and developing IT systems. There are, however, a number of technical challenges associated with managing and analyzing latency data. The volume emitted by a single data source can easily become very large. Data also has to be collected and aggregated from a large number of different sources and be stored over long time periods to allow historic comparisons and long-term service quality estimations (SLOs).
To address these challenges, a compression scheme can be applied to drastically reduce the size of the data before storage and transmission. Histograms are the most accurate, cost-effective technology to enable compression.
Histogram Data Structures
Histograms are a data structure that allows users to model the distribution of a set of samples such as the age of every human on earth. Instead of storing each sample as its own record, though, they are grouped together in buckets or bins. This allows for significant data compression and is economically superior. This compression of data allows for extraordinary metric transmission and ingestion rates, high frequency, real-time analytics and economical long-term storage. Histograms are also particularly useful in handling the breadth and depth of metric data produced by container technologies such as Kubernetes.
At Circonus, we’re passionate about histograms and how valuable they are for engineers and software developers, which is why we donated our histogram technology, OpenHistograms, to the open source community.
The problem is that the telemetry and monitoring industry has no single standard for histograms, and therefore all too frequently, users are leveraging them incorrectly, which has costly consequences. In this article, I’ll share why histograms are needed now more than ever and why the monitoring industry needs to embrace an open source, single-standard histogram technology.
Histograms: Needed Now More Than Ever
When the internet was small and users were not accessing services at high rates, you could more easily store and analyze each individual request and set standards around serving all requests accurately and quickly enough. Today there are many, many more user interactions being generated, collected and analyzed. But even more game-changing is that organizations now have multiple layers of systems, services and applications communicating with each other that are generating an overwhelming volume of data – significantly more than what’s possible by just users. For example, if you’re running a database on a system and you expect your discs to perform operations at a certain speed, this activity alone could generate a million data points per second, which ends up being almost a hundred billion per day.
Now, ensuring that all requests are served fast enough becomes an impractical objective, both from a capability and economic standpoint. It’s just not worth being perfect. So engineers must analyze the behavior of their systems and determine quantitatively what is good enough. If you’re servicing web pages or an API endpoint, how many errors are you allowed to have? How fast do you need to service requests? The problem with the question of how fast do most of them need to be is that you have two variables: how fast (measured in milliseconds) and how many (measured in a number like a percentile).
This is a really hard statistics problem to solve. On top of this, organizations have significantly more data to store. If recording every single transaction is exorbitantly expensive and doing the math of analyzing latencies on every single transaction is also expensive, then engineers need some sort of model that allows them to inexpensively store all of those measurements and answer that question of how many, how fast. The histogram is a perfect model for all of that.
Histograms can collect, compress and store all data points (billions!) and allow engineers to accurately analyze what percentage of their traffic is slower or faster than a certain speed – at low cost and zero overhead. Critically, they allow engineers to change both of those variables on the fly, after data ingestion. So instead of saying, “I need 99% of requests to be served faster than one second,” you can start to ask, “what does it look like when I have 98% of requests served faster than 5,500 milliseconds.”
Without histograms, you have to be able to phrase your questions specifically before you start, and engineers cannot do this with specificity and accuracy beforehand. Histograms allow you to store unlimited data and post-facto answer more complex statistical questions, which is what’s needed in today’s service-centric, rapid release cycle environment.
Histograms Must be Open Source
At Circonus, we’re open source advocates and believe most technology should be open source because it provides the assurance that users can be a stakeholder in it. The most important reason we’re passionate about our histogram technology being open source, however, is because users absolutely must have an industry standard around histograms, meaning organizations can use a single histogram technology across their monitoring stacks.
If you’re collecting your telemetry using different histograms from different vendors within your monitoring and observability stack – say, telemetry from your cloud provider and your telemetry from your APM provider – you cannot merge the data between histograms because they have different binning or different techniques. Unfortunately, users often do merge this data, introducing significant error that carries into the subsequent data analysis. This ends up hurting the operator and the end user.
The industry must focus on a single histogram model implementation because it increases compatibility between services and directly benefits the end user. Circonus’ implementation of histograms, Circhlist, has been in the industry since 2011. It has been independently tested and evaluated multiple times over the years and consistently deemed superior to other approaches in terms of balancing performance, accuracy, correctness and usability. With the goal of fostering and facilitating the interchangeability and ability to merge data between vendor platforms for all users, we recently released our histogram technology under the Apache 2.0 license to the open source community as OpenHistograms.
Circonus’ OpenHistograms are vendor-neutral log-linear histograms for the compression, mergeability and analysis of telemetry data. Two key differentiating factors for OpenHistogram is that it’s in Base 10, which eases usability, and that it does not require floating-point arithmetic, so you can run it on embedded systems that don’t have floating-point units.
OpenHistograms allow users to seamlessly exchange telemetry between vendor platforms without introducing error. Organizations that are faced with the challenge of digesting and analyzing massive amounts of distribution data can now rely on a consistent, interchangeable and stable representation of that data, a significant capability for the monitoring now and in the future.
Time for a Single Standard
The volume of data IT organizations are responsible for collecting and analyzing is growing substantially year over year. As a result, users are increasingly employing histogram technology as a way to measure service quality. A vast majority are merging telemetry data from different vendor histograms, and the output, while not apparent, is wrong. Organizations are inaccurately concluding they are hitting or not hitting SLOs and basing key operational decisions on this data that can cost them thousands of dollars a year.
Every engineer and app developer should feel confident that they can create a histogram, give it to someone, and know that they can accurately use it. By embracing vendor-neutral, industry-standard histogram technology, users have one source of truth and can rest assured their analysis is accurate.