How to Get Data Warehouse Performance on the Data Lakehouse
Data lakehouse architectures continue to grow in popularity, and that should come as no surprise. Their potential for seamlessly integrating the best features of data lakes and data warehouses promises a transformative experience for data processing and analysis. Yet, there are shortcomings to this approach. This article examines these challenges, like query performance and high costs, and identifies new technologies that are helping data lakehouses tackle them.
The Status Quo of Analytics on the Data Lakehouse
Data lakehouses have enticed numerous enterprises with the promise of flexibility, scalability and cost effectiveness. The reality, however, is that current lakehouse query engines fail at delivering query performance for low-latency or high-concurrency analytics at scale. Presently, the query engines that power these data lakehouses are bifurcated. On the one hand, we have engines optimized for extract, transform, and load (ETL) workflows, focusing on stage-by-stage operations. On the other hand, we see engines not leveraging modern optimizations such as single instruction, multiple data (SIMD) instruction sets, which are essential for harnessing the full power of modern CPUs.
This inherent performance limitation has pushed most users to copy their data from the lakehouse into proprietary data warehouses to achieve their desired query performance. But this is a costly workaround.
Cost #1: Data Ingestion Is Expensive
At the outset, ingesting data into a data warehouse seems like a straightforward procedure, but it’s far from it. This process necessitates converting data into the warehouse’s specific format, a task that demands considerable hardware resources. Moreover, this replication results in the redundancy of data storage — an expensive proposition in terms of cost and space.
It’s not just the physical resources either; the human effort demanded is equally significant. Tasks that seem mundane, such as aligning data types between the two systems, can drain resources. Furthermore, this data ingestion process inadvertently introduces latency, undermining the freshness of your data.
Cost #2: The Data Ingestion Pipeline Is Bad for Data Governance
The integrity and accuracy of data are paramount for any enterprise. Ironically, the very act of ingesting data into another warehouse, which should technically amplify its utility, poses serious challenges to data governance. How can you ensure all copies are consistently updated? How can you prevent discrepancies among different copies? And how can you do this while maintaining strong data governance? These are not just theoretical questions; they are serious technical challenges that require significant engineering effort and, when done incorrectly, have the potential to impact the veracity of your data-driven decisions.
A Modern Approach: The Pipeline-Free Data Lakehouse
The inherent challenges of data lakehouse query performance and the use of proprietary data warehouses as workarounds are pushing an increasing number of enterprises to seek out more efficient alternatives. One popular approach has been to adopt an ingestion-free lakehouse architecture. Here’s how this works.
An MPP Architecture with In-Memory Data Shuffling
Data lake query engines employ data shuffling for scalable performance, particularly with complex join operations and aggregations. However, many data lakehouse engines, originally designed for data lakes’ diverse and affordable storage, focus on data transformation and ad hoc queries, persisting intermediate results to disk. Although suitable for batch jobs, this method hampers the lakehouse’s evolving workloads, especially real-time, customer-facing queries. Additionally, disk-based shuffling introduces latency, impeding query performance and hindering immediate insights.
To navigate this challenge and run low-latency queries directly on the data lakehouse, embracing massively parallel processing (MPP) query engines equipped with in-memory data shuffling is a smart move. Unlike traditional approaches, in-memory shuffling bypasses disk persistence entirely. This ensures that the query execution is streamlined, with virtually zero wait time. Such operations are not only efficient but pivotal for achieving low query latency, enabling instantaneous insights directly from the data lakehouse.
A Well-Architected Caching Framework
One of the primary hurdles in optimizing data lakehouse queries lies in the expensive overhead of retrieving data from remote storage locations. The sheer volume and distributed nature of data in lakehouses make each scan a resource-intensive task. A well-designed built-in data caching system is necessary. The caching system should employ a hierarchical caching mechanism, leveraging not just disk-based caching but also in-memory caching, reducing data access from remote storage and thus reducing latency.
Furthermore, the efficacy of this caching framework hinges on its integration with the query engine. Instead of it being a standalone module that requires separate deployment — which can introduce complexity and potential performance bottlenecks — it should be embedded natively within the system. This cohesive architecture simplifies operations and ensures that the cache operates at peak efficiency, thereby delivering the best possible performance for data retrieval and query execution.
Further System-Level Optimizations
System-level optimizations like SIMD play an indispensable role in further improving lakehouse performance. For instance, SIMD enhancements facilitate concurrent processing of several data points with unified instruction. When combined with columnar storage, typically found in open data lake file formats like Parquet or Optimized Row Columnar (ORC), it allows data to be processed in bigger batches and significantly elevates the performance of online analytical processing (OLAP) queries, particularly those involving join operations.
Consider Open Source Solutions
Lastly, prioritize open source solutions. Embracing open source is critical if you want to maximize the benefits of your data lakehouse architecture. The data lakehouse’s inherent open nature extends beyond just the formats it supports; one of its paramount advantages is the flexibility it offers. This modularity means that components, including query engines, can be interchanged with minimal effort, allowing you to remain agile and adapt to the evolving landscape of data analytics with ease.
Pipeline-Free Data Lakehouses in Action: Trip.com’s Artnova Platform
All of this may sound good in theory, but what about in practice? Trip.com’s unified internal reporting platform, Artnova, offers a great example.
Initially, Artnova used Apache Hive as the data lake and Trino as the query engine. However, due to the vast volume of data coupled with the need for low latency and the ability to handle a high number of concurrent requests, Trino could not meet some use cases. Trip.com had to replicate and transfer the data into StarRocks, its high-performance data warehouse. While this strategy solved some performance issues, it also introduced more problems:
- Data freshness lagged despite the relatively fast ingestion, affecting the flexibility and timeliness of queries.
- It added complexity in the data pipeline due to additional ingestion tasks and table schema and index design requirements.
Duplicating data to another data warehouse is complex and expensive. Trip.com chose to initially move only the most business-critical workloads to StarRocks, but ultimately decided an architectural overhaul was necessary and expanded its use of StarRocks.
According to performance tests conducted by Trip.com, using StarRocks as the query engine is 7.4 times faster than Trino when querying the same data. With business-critical use cases further accelerated by StarRocks’ built-in materialized view, the performance gain is significant.
Go Pipeline-Free with Your Data Lakehouse
The evolution of the data lakehouse has reshaped data analytics, blending the advantages of data lakes and data warehouses. Despite its transformative potential, challenges like efficient query performance persist. Innovative solutions like MPP query execution, caching frameworks and system-level optimizations may bridge these gaps and enable enterprises to take advantage of all the benefits of the lakehouse with none of the drawbacks.