Data Warehouses Are Terrible Application Backends
The ever-increasing tide of data has become a paradox of plenty for today’s developers. According to a report from Seagate, by 2025 worldwide data will grow to a staggering 163 zettabytes, over 10 times the volume in 2016. More data should mean deeper insights and better user experiences, but it also leads to problems.
For data-oriented developers, this explosion is a double-edged sword. It presents an incredible opportunity to build user-facing features powered by data and harnessing real-time analytics. On the other hand, processing all that data with minimal latency and at high concurrency can be really challenging with a typical modern data stack.
Data warehouses, in particular, are the resting place of choice for big data at modern companies, and their online analytical processing (OLAP) approach is perfect for complex, long-running analytical queries over big data for things like business intelligence reports and dashboards.
However, they make terrible application backends.
This post explains why a combination of job pool management, concurrency constraints and latency concerns preclude data warehouses from effectively functioning as a storage layer for user-facing applications, and why you should consider alternative technology for your data app stack.
Understanding the Data Warehouse
Ten years ago, data warehouses were the hot, new thing in the data world. Capable of storing a whole mess of structured data and processing complex analytical queries, data warehouses set a new bar for how businesses run their internal business-intelligence processes.
Specifically, data warehouses do three things that have made analytics accessible and powerful:
- They separate storage and compute, reducing costs to scale.
- They leverage distributed compute and cloud networking to maximize query throughput.
- They democratize analytics with the well-known SQL.
If you want a good primer on why data warehouses exist and what they’ve enabled for modern data teams, I encourage you to read this.
Today, data warehouses like Snowflake, BigQuery, Redshift and Azure Synapse still occupy the head of the table in many companies’ data stacks, and because of their favored position within the organization, developers may be tempted to use them as a storage layer for user-facing analytics. They have the power to run complex analytical queries that these use cases demand; the data is already there, and you’re already paying for them. What’s not to love?
As it turns out, quite a bit. Here are the reasons why application developers can’t rely on data warehouses as a storage layer for their user-facing analytics.
The Unpredictable World of Job Pools and Nondeterministic Latency
Data warehouses process analytical queries in a job pool. Snowflake, for example, uses a shared pool approach to process queries concurrently, aiming to optimize available computing resources.
Here’s the problem: Job pools create nondeterministic latency with a set floor. A simple
SELECT 1 on Snowflake could potentially run in a few milliseconds, but more likely it will take a second or more simply because it must be processed in a queue with all the other queries.
Even the best query-optimization strategies can’t overcome this limitation.
Running a query on a data warehouse is like playing a game of “latency roulette.” You can spin the wheel the same way every time, but the final outcome (in this case, the latency of the query response) lands unpredictably.
Now, if you’re a backend developer building APIs over a storage layer, you’d never take a chance on nondeterministic latency like this. Users expect snappy APIs that respond within milliseconds. In fact, the database query should be one of the fastest things in the request path, even compared to network latency. If you’re building on top of a data warehouse, this won’t be the case, and your users will feel the pain.
The Illusion of Scalability
Latency is but one part of the equation for API builders. The second is concurrency. If you’re building an API that you intend to scale, solid fundamentals demand that you provide low-latency responses for a highly concurrent set of users.
When you dive deeper into the functionality of data warehouses, you’ll realize that to genuinely scale horizontally to accommodate increased query concurrency, you need to either spin up new virtual warehouses or increase their cluster limit, or both. For example, if you wanted to support just 100 concurrent queries per minute on Snowflake, you’d need 10 multicluster warehouses.
Concurrency constraints in data warehouses like Snowflake present one of the most significant challenges when it comes to developing real-time applications. With a large volume of queries knocking at your warehouse’s door, and a limited number of resources to serve them, you’re bound to experience some serious latency issues unless you scale up and out. And scaling up and out is often prohibitively expensive.
Building Cache Layers: A Recent Trend and Its Drawbacks
OK, so nobody really builds an application directly on top of a data warehouse, right? Obviously, you’d use a caching layer like Redis or some other real-time database to make sure your API requests are fast and balanced even with many concurrent users.
This is a common approach when the data you need to support your application resides in a data warehouse. In theory, the approach seems workable. In reality, it carries some serious drawbacks, the most significant of which is data freshness.
Simply put, using a cache layer works great for shrinking query latency, but it still won’t work for applications built over streaming data that must always serve the most recent events.
Think about a fraud-detection use case where a financial institution must determine if a transaction is fraudulent within the time it takes to complete the transaction (a few seconds). This usually involves a complex analytical process or online machine learning feature store based on just-created data. If that data hits the warehouse before your backend APIs, no cache layer will save you. The cache is great for enabling low-latency API requests by storing analytics recently run in batch ETL (extract, transform, load) processes, but it can’t access just-created data, because the warehouse is still processing it.
The Alternative: Real-Time Data Platforms
As we’ve discussed, the fundamental problem of building data-intensive applications over data warehouses boils down to a failure to maintain:
- low-latency queries
- from highly-concurrent users
- over fresh data
So, what’s the alternative?
What Is a Real-Time Data Platform?
A real-time data platform helps data and engineering teams create high-concurrency, low-latency data products over streaming data at scale.
A real-time data platform uses a columnar database under the hood so it can handle the complex analytical workloads previously relegated to data warehouses but at a much faster pace. Furthermore, a real-time data platform typically offers a low-latency publication layer, exposing low-latency APIs to data-intensive applications that may rely on both batch and streaming data sources. Building APIs at scale for streaming data platforms is often not considered but can be a massive pain to maintain and scale as your data scales.
Reference Architectures for Real-Time Data Platforms
When building on top of real-time data platforms, consider two incremental architectures for your data stack.
In the first approach, the data warehouse can still be the primary underpinning storage layer, where the real-time data platform effectively serves as a publication layer. In this architecture, data is synced between the data warehouse and the real-time data platform either on a schedule or on ingestion, and the real-time data platform handles additional transformations as well as providing a low-latency, high concurrency API.
In practice, this is similar to using a real-time data platform as a caching layer, with the added benefit of avoiding the need to write custom API code to connect the cache with your application and having the ability to perform additional enrichment or transformations with the power of full online analytical processing (OLAP).
The second approach bypasses the data warehouse entirely or operates in parallel. Assuming event data is placed on some kind of message queue or streaming platform, the real-time data platform subscribes to streaming topics and ingests data as it’s created, performing the necessary transformations and offering an API layer for the application to use.
This can be the preferred approach since it eliminates the data freshness issues that still exist when a caching layer is used over a data warehouse and, with the right real-time data platform, streaming ingestion can be quite trivial.
The Benefits of a Real-Time Data Platform
- Native data-source connectors: Real-time data platforms can integrate with various data sources and other tech stack components. This makes it quite easy to unify and join multiple data sources for real-world use cases. For example, you can combine data from Snowflake or BigQuery with streaming data from Confluent or Apache Kafka. Tinybird, for example, even offers a simple HTTP-streaming endpoint that makes it trivial to stream events directly within upstream application code.
- Real-time OLAP power: Like data warehouses, a real-time data platform gives developers the ability to run complex OLAP workloads.
- Cost-effective: Establishing a publication layer on Snowflake using traditional methods would necessitate additional virtual warehouses, thereby leading to increased costs. In contrast, the pricing model for real-time data platforms are often predicated on the volume of data processed via the publication layer, resulting in a significant reduction in cost when used as an application backend.
- Scalability: Many real-time data platforms are serverless, so infrastructure scales with you, handling big data with high levels of performance and availability. Rather than host your database on bare metal servers or tweak cluster settings with managed databases, you can focus on building and shipping use cases while the real-time data platform handles scale under the hood.
- Zero glue code: Even with a cache layer over data warehouses, you’d still have to write glue code: ETLs to get data from the warehouse to your cache, and ORM code to publish APIs from your cache. A real-time data platform, in contrast, handles the entire data flow, from ingestion to publication, with zero glue code. Data gets synced using native connectors, transformations get defined with SQL, and queries are instantly published as scalable APIs with built-in documentation, authentication token management and dynamic query parameters.
Where data warehouses fail as application backends, real-time data platforms like Tinybird shine. Like data warehouses, these platforms support heavy data loads and complex analytics, but they do so in a way that preserves data freshness, minimizes query latency and scales to support high concurrency.
Data warehouses aren’t bad technology, but they are bad application backends. Despite their power and usefulness for business intelligence, they simply can’t cost-effectively handle the freshness, latency and concurrency requirements that data-oriented applications must support.
Real-time data platforms, on the other hand, function exceptionally well as backends for data-intensive applications across a wide variety of use cases: real-time personalization, in-product analytics, operational intelligence, anomaly detection, usage-based pricing, sports betting and gaming, inventory management and more.
Ready to experience the industry-leading real-time data platform? Try Tinybird today for free.