What’s ‘Pipeline-Free’ Real-Time Data Analytics?
Organizations across various industries are dealing with massive volumes of data that require extensive analysis and querying to help them better serve their customers. The sheer scale of this data can involve tens of thousands of metrics and dimensions, and data stores numbering in several petabytes.
To achieve real-time analytics, it usually takes a monumental effort to implement the query layer. Many organizations turn to open source alternatives like Apache Druid or Presto, along with data denormalization in separate pipelines, to ingest diverse data sources for multi-table queries.
However, this process demands significant resources and expertise, involving teams of engineers for implementation and maintenance, leading to time-consuming and resource-intensive efforts. Even minor schema changes can require days of effort, creating challenges for large organizations.
“Many people tend to give up on real-time analytics because of the organizational complexities they face when dealing with the software,” Sida Shen, product manager at CelerData, told The New Stack. “It’s the primary challenge they encounter, and it often leads them to dismiss the idea altogether.”
In this article, we explore an alternative approach to data analytics that eliminates the need for traditional data pipelines.
The Limits of Traditional Data Pipelines
Traditional pipelines lack flexibility, making it cumbersome to modify data models or pipelines. Each component adds complexity and increases the possibility of failure. Those components will likely lead to degradation in performance over time, not to mention the high operational costs.
Proper real-time analytics relies on various data transformations and data-cleaning processes. Additionally, pre-aggregation — which involves performing certain calculations in advance, such as denormalization — is used. (Denormalization means adding precomputed, redundant data to a relational database to improve its read performance.)
A “pipeline-free” solution addresses delays in data refreshing, minimizes latency, and reduces the complexity associated with denormalization and pre-aggregation steps, which often introduce time limits and delays in real-time analytics.
“The main advantage of going pipeline-free for real-time analytics is that it becomes much more accessible to a broader range of users, including those who may not be experienced engineers,” Shen said.
“With fewer complexities, organizations can easily manage their data and keep their five tables intact within the database, without resorting to the cumbersome process of pre-joining them into one table. This added flexibility is a significant boon, making the entire data more efficient.”
Pre-aggregation and denormalization are “Band-Aids” needed to enable real-time queries, along with dashboards and data applications across distributed data sources within and outside of the enterprise, according to Torsten Volk, an analyst for Enterprise Management Associates.
“Both practices sacrifice efficiency, flexibility and cost for query performance and simplicity,” Volk told The New Stack. “The more data sources we connect and the more historical data we include, the more we blow up the size of the underlying data stores, and the more SQL we have to write to join database tables.
“This makes it harder to manage and update data pipelines,” he added. “All of these factors dramatically lower the enthusiasm for building real-time data apps and queries, preventing enterprises from enhancing and automating their decision-making capabilities. “
A pipeline-free, real-time analytics alternative can significantly reduce the headaches organizations face during real-time analytics projects.
By using multi-table joins, you can eliminate the denormalization process and streamline real-time analytics processes, offering a substantial advantage in managing and implementing data pre-aggregation internally.
Joins are used to merge data from two or more tables into a unified column relational database. CelerData describes the joins its offers with open source StarRocks as essential for real-time analytics. This eliminates a vast number of steps, resources and operational complexities, making real-time analytics more manageable and efficient.
Airbnb recently migrated to StarRocks. With it, Airbnb engineers can maintain the tables in a snowflake schema and perform joins on the fly at query time, according to CelerData.
“This current definition of ‘pipeline-free’ refers to freeing data pipelines from the overhead generated by data scientists and analysts working around the limitations of joining data across standard row-based database systems,” Volk said.
“Everyone who has written the SQL code required to pull this off for a few dozen tables knows how hard it is to predict query performance and query cost and also how difficult it is to still understand your own query next week. Eliminating this overhead is a really big deal.”
An Open Source Alternative
As mentioned previously, many organizations struggle with real-time analytics due to the complexities of setting up data pipelines and managing denormalization processes. This often deters them from fully embracing real-time data analysis, leaving them feeling overwhelmed and opting for traditional batch processing solutions.
However, there is a promising alternative that can simplify real-time analytics and make it accessible to a broader audience. By leveraging tools like StarRocks, an open source project created in 2020, organizations can achieve real-time analytics without the need for extensive data pipelines or additional stream-processing tools.
“StarRocks provides built-in functionality to support these operations, eliminating the need for additional tools like Spark Streaming,” Shen said.
Thus far, interest in StarRocks, an online analytical processing (OLAP) database donated to The Linux Foundation in February, has racked up more than 5,000 GitHub stars and 1,200 forks.
StarRocks is a sub-second massively parallel processing (MPP) OLAP database for full analytics scenarios, including multidimensional analytics, real-time analytics and ad hoc queries, according to the project’s documentation.
While CelerData created and largely maintains the project, it is drawing interest from the developer community, with over 1,500 active pull requests, 70 active developers and 624 commits to the main GitHub branch this month.
Indeed, StarRocks has drawn “an active developer community,” Volk said, adding, “These metrics confirm that the StarRocks project is real and that there is significant demand for a database platform that enables real-time analytics right out of the box.”
Pre-aggregation plays a crucial role when using StarRocks. By performing calculations ahead of time, organizations can streamline analytics processes and significantly reduce resource and time consumption. Furthermore, with pipeline-free real-time analytics, it’s more efficient to manage data refreshing; the approach also minimizes latency and delays in data availability.
One of the most significant advantages of adopting this “pipeline-free” strategy is flexibility. Unlike traditional solutions that force organizations to pre-join multiple tables into a wide table, pipeline-free analytics allows them to keep individual tables in the database. This freedom to maintain separate tables and make schema changes without backfilling historical data can prove invaluable for scaling and managing data efficiently.
Incorporating StarRocks into real-time analytics empowers organizations to handle massive amounts of data with ease. Whether they are large corporations or small Software as a Service providers, StarRocks adapts to various use cases and data sizes. The end-to-end latency of less than 10 seconds ensures timely and accurate results, making it an ideal choice for organizations seeking efficient real-time data analysis solutions.
Ultimately, by embracing pipeline-free real-time analytics with StarRocks, organizations can streamline their processes, minimize complexity and unlock the full potential of their data analytics endeavors.
“Addressing query performance, query cost, data freshness and complexity is a critical step forward as it makes real-time analytics more accessible for enterprise use cases,” Volk said. “It is all about squeezing the most value out of the data you have ‘lying around anyway,’ and this is what the StarRocks database aims to help you achieve.”