CelerData Upends Real-Time Data Analytics with Dynamic Table Joins
The shift to real-time analytics, infrastructure, and architecture is impacting organizations across industries and use cases whether involving Internet of Things deployments like digital twins or wearables, horizontal concerns like supply chain management, or fraud detection and recommendation engines in AdTech, the need to analyze and act on data with low latency is only increasing.
The most accomplished OLAP databases for such tasks are written in C++ to accommodate these performance needs. Many integrate with streaming data platforms like Apache Flink or Spark Streaming to handle the preprocessing their architectures require for such timely analytics.
Regardless of the particular approach or database involved in such matters, there’s no getting around one simple fact that’s consistently proved determinative for real-time OLAP databases.
In almost all cases, the data that are analyzed is on more than one table.
“Aside from analyzing logs, or analyzing user behavior, and sometimes not even that, for every other scenario you actually need joins,” revealed CelerData product marketing manager Sida Shen. “There’s really not that many scenarios where you don’t need joins.”
CelerData’s real-time, open source OLAP database StarRocks is one of the few options in this space that dynamically performs join operations on tables with low latency data. Because of its architecture, this real-time database is considerably more flexible, swifter, and cost-effective than many of its competitors are — which produces tremendous advantages for users when it’s deployed at an enterprise scale.
According to Shen, StarRocks’ ability to rapidly perform dynamic joins on real-time data is “unique” among OLAP databases in this field. From an architectural perspective, this advantage is largely due to the fact that StarRocks “has a natively built cost-based optimizer,” Shen remarked, which supports scalable join operations. Typically, other OLAP databases can only process single table queries on real-time data and require preprocessing to join tables so organizations can query across them.
Considering the speed and sizes of the data in real-time analytics use cases, preprocessing for joins is “one of the most expensive things you can do with OLAP databases, joining two large tables,” Shen commented. Since StarRocks can join tables on the fly for these low latency use cases, its users avoid those costs and the time spent denormalizing their tables to facilitate joins. “Data lake engines can do joins because they do ETL jobs, but real-time OLAP databases give up on that because it needs a lot of optimization on the query planning side,” Shen explained. “Our architecture supports joins internally.”
Without the capability to dynamically join tables, other OLAP databases for real-time analytics account for this fact with denormalization processing that frequently entails platforms like Spark Streaming or Flink. “Denormalization is when you pre-join your tables together based on your query pattern,” Shen specified. After the tables are joined into a flat table in the preprocessing platform, the latter table is ingested and analyzed in the real-time OLAP database. It’s not uncommon for organizations to generate copious amounts of code for these operations, which may be tenuous.
“This is where it gets very complicated,” Shen admitted. “It’s very difficult to configure, it breaks a lot, and it requires a lot of resources. Just a lot of maintenance, and this is on the cost side, like hardware and man-hour costs.” Moreover, when schema changes arise, there’s a definite possibility of having to redo this preparation work. In that case, “you have to reconfigure the entire pipeline and sometimes you need to backfill all of the data for your flat table,” Shen observed. “Because one thing changes, the whole flat table can change.”
Organizations can avoid such inflexibility, costs, and time preprocessing their tables by employing a real-time OLAP database that joins tables at enterprise scale for instant data analysis. StarRocks’ architecture enables it to support in-memory data shuffling, which helps with joins and complicated aggregation operations. Data shuffling becomes influential in distributed environments in which “one of the challenges is to send the data to the appropriate nodes, so the nodes can get the data and they all do their part,” Shen noted. “Data shuffling is, basically, you shuffle the two. Let’s say you join two tables and shuffle all the data on the join key to all of the nodes.”
This operation allows organizations to perform scalable joins. Without it, users would have to attempt what Shen termed a “broadcast join” that involves replicating a smaller table and sending it to all the nodes. According to Shen, for CelerData’s real-time OLAP competitors, “The most they can do without shuffling is to have a big table join a very tiny table on a cluster that’s not very big. But we can do a big table joining a big table or any other kind.”
Additionally, because StarRocks is based on C++, some of its performance gains — which become palpable when competing with other Java-based query engines like Presto or Trino for directly querying data lakes — are based on its utilization of Single Instruction, Multiple Data (SIMD) instructions. With SIMD, “you process multiple data points with one instruction, so you touch your memory a lot less by executing one query,” Shen said. This increased efficiency is characteristic of OLAP databases predicated on C++; Shen mentioned it’s not possible with JAVA-based options.
The End of Table Denormalization?
A real-time OLAP database that dynamically joins tables whenever organizations specify it has considerable consequences for real-time analytics. On the one hand, it could herald an end to denormalization and the time, effort, and costs denormalization exacts from organizations to pre-join tables according to specific query patterns. On the other, it could signal an era in which there’s much more flexibility for real-time databases to adjust to changes in schema, source data, and business requirements. Either way, this capability could further advance the usefulness of real-time data analytics.