TNS
VOXPOP
Which agile methodology should junior developers learn?
Agile methodology breaks projects into sprints, emphasizing continuous collaboration and improvement.
Scrum
0%
Kanban
0%
Scrumban (a combination of Scrum and Kanban)
0%
Extreme Programming (XP)
0%
Other methodology
0%
Bah, Waterfall was good enough for my elders, it is good enough for me
0%
Junior devs shouldn’t think about development methodologies.
0%
Data / DevOps

How to Go Pipeline-Free with Your Real-Time Analytics

Shifting from tedious denormalization jobs and allowing data to be queried directly can significantly improve efficiency and flexibility.
Aug 16th, 2023 8:09am by
Featued image for: How to Go Pipeline-Free with Your Real-Time Analytics
Image from SkillUp on Shutterstock.

Query latency poses a significant challenge for real-time analytics. Traditionally, data practitioners have turned to denormalization to mitigate this issue. However, this workaround has significant drawbacks such as inflexibility, complexity and high costs. Let’s explore an innovative approach to solving these challenges: going “pipeline-free” and its practical implementation and benefits for real-time analytics.

What Is ‘Pipeline-Free’ in Real-Time Analytics?

When we refer to a “pipeline,” we are specifically talking about the preprocessing stage in real-time analytics traditionally required for downstream online analytical processing (OLAP) workloads. This often involves expensive operations including real-time denormalization and pre-aggregation for faster query processing. The phrase “pipeline-free” here suggests a shift from this norm, by eliminating these preprocessing stages and allowing data to be queried directly, without requiring time-consuming and resource-heavy transformations.

The Status-Quo of Real-Time Preprocessing Pipelines

Query latency is a critical component of real-time analytics. Still, the resource-intensive nature of JOIN operations makes it challenging: Not all real-time OLAP databases are optimized to perform them on the fly. To overcome this, data practitioners use denormalization: pre-joining tables into a large one during data preprocessing, which “locks” the data from its natural snowflake or star schema to a giant flat table.

Comparing flat-table to star and snowflake schema

Figure 1: Comparing flat-table to star and snowflake schema

Performing denormalization in your real-time analytics pipeline poses the following risks:

  • Reduced flexibility: Establishing data pipelines based on prior query patterns limits the flexibility to respond to changing business requirements.
  • System complexity: Every additional component added to the system increases its complexity.
  • Higher costs: The use and maintenance of stateful stream processing tools, which are crucial for denormalization, demand greater engineering efforts and are more expensive to maintain.
Figure 2: The complex and all too common state of real-time pipelines

Figure 2: The complex and all too common state of real-time pipelines

An original measure of convenience has become a roadblock, making real-time analytics complex and expensive.

Going Pipeline-Free

These complexities are leading many to explore innovative, alternative strategies that could be game-changing for their real-time analytics challenges.

Replace Denormalization with JOINs at Query Execution

Denormalization is expensive and makes your analytics rigid. On-the-fly JOINs, however, don’t impose this rigidity. So why aren’t more people using them? JOINs have had a bad rap when it comes to on-the-fly performance for so long, many people haven’t kept up with advancements in the space enough to know how this reputation is largely undeserved these days. In fact, with the right technology, modern OLAP databases are well-optimized to perform JOINs on the fly. Here’s what you should be looking out for:

Parallel Computation

Massively parallel processing (MPP) architecture can greatly improve JOIN performance in large tables by dividing a query into multiple computing instances and running them across separate nodes. MPP architecture offers near-linear horizontal scalability, increasing performance proportionately with the addition of computational resources. The key to this scalability is the ability to shuffle between nodes, which is essential for high-cardinality aggregation and large table JOINs.

Figure 3: Example MPP architecture

Figure 3: Example MPP architecture

Cost-Based Optimizer (CBO)

A cost-based optimizer (CBO) is a tool that helps generate optimized execution plans for complex OLAP queries. It is especially useful in handling multi-table JOIN queries where numerous potential execution pathways exist. Drawing from statistics of the actual data, like identifying the shortest route between two points, it calculates and selects the most effective query execution plans.

Global Runtime Filter

Global runtime filter plays an essential role in enhancing the efficiency of JOIN operations. By dynamically pruning irrelevant data during query execution, it reduces the volume of data being transferred and processed, significantly boosting the performance of complex multi-table queries.

System-Level Optimizations

System-level optimizations, such as single instruction, multiple data (SIMD), are just as vital for enhancing on-the-fly JOIN operations. SIMD optimizations enable the simultaneous processing of multiple data points using the same instruction. Paired with columnar storage, it increases the performance of complex OLAP queries, especially the ones with JOIN operations.

Partial Update: A ‘Backup’ Solution for More Demanding Scenarios

Even with the best solution, there are more demanding scenarios, such as highly concurrent user-facing analytics, where denormalization may still be necessary. That’s where the concept of partial updates comes into play.

Figure 4: An example of a partial update

Figure 4: An example of a partial update

Partial updates allow updates to individual row columns rather than joining multiple tables in an upstream processing job. By avoiding the need for upstream stream preprocessing, partial updates offer an efficient, agile approach to managing complex data workloads.

How Airbnb Does Pipeline-Free Real-Time Analytics

Airbnb’s Minerva platform, which handles over 30,000 metrics across 7,000 dimensions and 6+ petabytes of data, serves diverse teams for varied applications, including A/B testing and data exploration. Initially, Minerva used Apache Druid and Presto as its query layer. The need to denormalize data for multi-table query performance made the pipeline expensive and inflexible, with new metrics requiring days to add.

Figure 5: Airbnb’s initial architecture

Figure 5: Airbnb’s initial architecture

To enhance flexibility and efficiency, Airbnb migrated Minerva from Druid and Presto to a new MPP OLAP database StarRocks. This new solution shines with its capability to efficiently handle JOIN operations, a strength from its built-in advanced features including a cost-based optimizer, global runtime filters, SIMD optimizations and many more.

Figure 6: Airbnb’s new architecture

Figure 6: Airbnb’s new architecture

With the new approach, data could be kept in a snowflake schema and JOINs performed on the fly at query execution. This shift freed Minerva engineers from resource-intensive denormalization. Updates to metrics no longer require data backfill or table reconstruction, resulting in significant cost savings.

Experience Real-Time Analytics on Easy Mode

The traditional approach of denormalization, although once a standard in real-time analytics, comes with its set of complications. However, we are at the forefront of a shift in these practices. With techniques such as MPP, CBO, global runtime filters and SIMD optimizations, new approaches make on-the-fly JOIN operations a reality. Industry leaders like Airbnb have set a precedent by adopting advanced strategies for their real-time analytics. By shifting from tedious denormalization jobs to pipeline-free, they’ve significantly improved efficiency and flexibility. As more businesses follow this path, we expect sweeping transformations in the real-time analytics field.

Group Created with Sketch.
TNS owner Insight Partners is an investor in: Pragma, Real.
THE NEW STACK UPDATE A newsletter digest of the week’s most important stories & analyses.