How Open Source Arrow Helps Solve Time Series Data Dilemmas
Companies and other organizations have been using metrics stored in time series databases (TSDBs) for critical functions such as monitoring, alerting and automating processes. However, they had a harder time deriving other insights and value from those databases due to limitations imposed by cardinality constraints and specialized query languages.
Now, the evolution of Apache Arrow — a popular open source multilanguage toolbox for accelerated data interchange and in-memory processing — creates new opportunities for improved real-time analytics and time series applications beyond traditional use cases such as climate modeling, finance and even AI.
Indeed, users of time series databases historically struggle with high-cardinality use cases, according to Rachel Stephens, an analyst for RedMonk. High-cardinality data sets are those that have a large and often unbounded set of unique possible values in a given field.
For example, take user IDs, which have a large number of possible distinct values, or trace IDs, Stephens told The New Stack. This has historically meant that in infrastructure monitoring use cases, TSDBs were effective for measuring metrics over time, but cardinality limitations didn’t allow for logging or tracing use cases.
Apache Arrow is language-agnostic, facilitating building and querying large-scale databases that must transfer and process data in fractions of seconds for access by distributed end-users in a columnar data format.
After working with Apache Arrow, InfluxData applied its domain expertise in time series data to address specific requirements, such as compactions for more efficient data storage. It can build high-performance databases by leveraging Arrow’s upstream tools and libraries.
Arrow, Parquet and Rust
InfluxData also draws upon the Apache Parquet column-oriented data storage format, along with the Arrow in-memory format. Apache Parquet is designed for data storage and retrieval, which provides efficient data compression. The company also implemented its new database engine write-read data in the Rust programming language.
InfluxData has integrated Arrow into InfluxDB, allowing users to take advantage of columnar data formats and improved analytics. This development enables sub-second query responses with InfluxDB for its time series data platform and storage. As a result, real-time analysis is now possible for monitoring, alerting and analytics on large fleets of devices.
The end result, based on InfluxData’s work with Apache Arrow, Apache Parquet, Apache DataFusion and Rust, is InfluxDB 3.0, a new time series database engine that the company says is much more efficient than its predecessor without being limited by cardinality restrictions.
“We have achieved this by employing optimization techniques like vectorization, predicate pushdowns, aggregate pushdowns, parallelism and more,” Rick Spencer, vice president of product at InfluxData, told The New Stack. “Collectively, these advancements enable you to perform analytics at the leading edge of data processing.”
Thus, developers can build high-performance databases by leveraging the upstream tools and libraries that Arrow provides, Spencer said.
“InfluxData is a poster child for Apache Arrow, which we used to build InfluxDB’s core engine,” he said.
InfluxData also plans to release a cluster version of InfluxDB 3.0 so developers can run it in their own Kubernetes cluster. Spencer added. “This will give them more flexibility and control over their deployments.”
Exploring the Layers
InfluxData summarized how Arrow, Parquet and Rust support InfluxDB’s new engine this way:
- Rust is a cutting-edge programming language designed for speed, efficiency, reliability and memory safety.
- Apache Arrow is a framework for defining in-memory columnar data.
- Apache Parquet is a column-oriented durable file format.
- Arrow Flight is a client-server framework designed to transport large datasets over network interfaces without significantly impacting performance.
- Apache DataFusion drives the query engine and provides native SQL support.
InfluxData also separated out the compute and storage layers — with Apache Parquet used as the persistence format for the object store — and separate ingest, query and compression layers of compute on top, RedMonk’s Stephens said.
“This ability to work with unbounded distinct values opens up more use cases for time series engines,” she said.
In previous versions, InfluxDB indexed data based on tags. In 3.0, InfluxDB writes data into Parquet files (which have high compression), stored in object storage (which is much more scalable and cheaper than SSD storage), and then queried with a query tier (which is more elastic).
Additionally, users can now query in SQL in addition to InfluxQL thanks to Apache DataFusion, Stephens said: “This improves ecosystem compatibility and the ability for users to integrate InfluxDB into more upstream communities, as well as into their existing tools.”
Advantages in Scaling
From a CTO’s perspective, InfluxDB 3.0 is of particular interest when working with a database or time series data system while experiencing challenges with scale or cardinality limitations, according to Spencer.
“Many customers come to us when their existing systems no longer meet their needs due to scaling issues,” he said.
“InfluxDB 3.0 provides a purpose-built solution for time series data, allowing organizations to handle large volumes of observability data and the full range of time series data. This means unlimited quantities of metrics, events and traces, providing valuable insights for monitoring and analysis purposes.”
The compatibility with popular libraries and tools is another advantage InfluxDB 3.0 offers, according to the company.
For instance, Pandas, a widely used Python analysis library, has native support for Arrow, and the next version of Pandas will be based on Arrow. This compatibility opens up possibilities for various use cases, Spencer noted, such as machine learning pipelines.
Additionally, the Flight and Flight SQL client-server protocols enable seamless integration with other tools like Dremio, allowing data availability across the organization, Spencer said.
“Developers can start using Arrow by simply grabbing the right libraries and tools,” he said. “For instance, they can use the InfluxDB client library and write SQL queries, which can be converted into Pandas data frames for further analysis in tools like Jupyter Notebook. Signing up for an InfluxDB account is an easy way to get started.”
Apache Arrow is rapidly becoming a standard for communications between tools used for big data storage and analytics. Originally, this was initially driven purely by the performance capabilities of Apache Arrow.
However, the inherent network effect of the Arrow ecosystem is driving adoption as well, with newer market entrants adopting Arrow to make integration into existing developer tools and workflows easier. The result is a win-win for both parties, with InfluxData being one of the early adopters leading the charge.