Real-time Analytic Databases — Thing or Not a Thing?
While the term “real time” can be used as a marketing spin in some cases, there are genuine technical and functional differences between real-time analytic databases and conventional analytic databases. Real-time analytic databases (aka streaming databases) are a distinct category of analytic databases that are optimized for processing and analyzing high-volume, high-velocity data in near real time.
Conventional analytic databases are optimized for processing large volumes of historical data in batch mode. While these databases can provide valuable insights into past trends and patterns, they weren’t designed for real-time decision-making or infusing analytics into downstream applications at web scale.
Examples of conventional analytic databases include Snowflake, Greenplum, BigQuery, Redshift and Teradata, among others.
Real-time analytic databases, on the other hand, can process and analyze data as it arrives, allowing organizations to make informed decisions and take immediate actions based on the most up-to-date information.
They are designed to provide low-latency querying, fast ingestion and scalable processing of streaming data generated by modern devices, machines and sensors. Examples of databases in this class would include Kinetica, Pinot, Druid, Rockset, Materialize, ClickHouse, SingleStore and Aerospike, among others.
The “steelman” argument against the notion that real-time analytic databases are fundamentally different from conventional analytic databases is that real-time analytic databases are simply an extension of the traditional analytic database paradigm with added real-time capabilities. The difference is more a matter of degree than a fundamental shift in the underlying technology.
Proponents of this argument point out that both real-time analytic databases and conventional analytic databases are designed to store and analyze large volumes of data and the underlying principles of data storage, indexing and querying are largely the same in both cases. Moreover, many conventional analytic databases now offer some level of real-time processing capabilities, such as micro-batch loads or the latest query-accelerating technique, blurring the distinction between the two categories.
To make the case for real-time analytic databases being a distinct category, consider the below framework based on data latency and query latency. Data latency refers to the time delay between when data is generated and when it is available for processing and analysis. This delay can be caused by a variety of factors, but primarily network speed and ingest overhead.
Query latency refers to the time delay between when a query is submitted to a data processing system and when the results of that query are returned. Query latency is primarily a function of query complexity, the amount of data being queried, the type of storage and the level of sophistication of the query engine.
Data Latency Differences
Ingesting streams of data involves processing data as it arrives in real time, often through a continuous flow of data. A best-in-class real-time analytic database will have three essential features to radically reduce data latency.
- Native streaming connectors optimize the data ingestion from a streaming data source (think Kafka, NiFi, Kinesis, etc.) into a database. Using a native streaming connector can offer several advantages over generic database connectors, including faster ingest speeds, greater reliability and simplified integration. Leading real-time analytic databases have native streaming connectors to both the sink and source.
- Headless ingest can be particularly beneficial in systems that need to support data ingestion from a large number of sources (think tens of thousands of different sensors), as it allows data to be ingested in a decentralized manner, without the need for a central coordination point that acts as a bottleneck.
- A lockless architecture in a database refers to a design approach where the database minimizes or eliminates the use of locks to control access to shared resources in a multi-user environment. Locks are often used to prevent two or more transactions from accessing the same data simultaneously, which could result in data inconsistency. However, locking significantly contributes to data latency because tables are locked during the loading process and the newly loaded data is not available to query until the load is complete and the table unlocked. In a lockless architecture, alternative methods ensure consistency and concurrency control, such as optimistic concurrency control or multiversion concurrency control. These methods allow multiple transactions to access the same data concurrently without locks. Instead, the database uses timestamps or version numbers to track changes to the data and resolve conflicts. Lockless architectures can provide better scalability and performance in high-volume, distributed or real-time database systems. However, they also require sophisticated algorithms and careful design to ensure data consistency and correctness.
Conventional analytic databases do not have native streaming connections to the source and sink, centralize ingestion through a coordination point and make extensive use of table locking that together drives significant data latency.
Query Latency Differences
Once the data is available to query, the speed of the query matters. A best-in-class real-time analytic database will have three essential features to radically reduce query latency.
- Every so often there’s a step function change in query performance because of a breakthrough database design innovation (think distributed, columnar, in-memory). The latest innovation is vectorization (aka data-level parallelism). Vectorized query engines store data in fixed-size blocks, known as vectors, and perform query operations on these vectors in parallel instead of individual data elements. This approach enables the query engine to process multiple data elements simultaneously, leading to order of magnitude faster query execution and improved performance. By using this vectorized approach, the query engine not only enhances query performance but also minimizes the compute and data engineering needed (less indexing and denormalization).TPC-DS benchmarks prove this point.
- Materialized views are an old trick to accelerate a query by caching results in advance, but they come at the expense of querying stale data. However, continuously updated materialized views produce aggregated results that remain fresh as a result of triggered refreshes upon a defined incremental change. This removes the tradeoff and makes the database both fast and fresh.
- In some cases, the end user of a real-time insight isn’t a data scientist or team of business users, but tens of thousands of people using a data-infused app (think ride-sharing app or financial trading desk). To support that volume of high-speed reads, the insights must be moved from the analytic database (such ascolumnar) to an operational database (like. key-value). This is a hop that adds more latency and complexity. Modern real-time analytic databases address this by converging these design patterns.
Conventional analytic databases are not fully vectorized because they simply have too much tech debt to take advantage of this innovation and vectorize all their operations. Conventional analytic databases still rely on the old model of materializing views and treat high-speed reads at scale as a task for a different database. Taken together, real-time analytic databases produce far and away the freshest possible insights for the next generation of data-infused apps.
It’s not to suggest that all real-time analytic databases are the same. Some provide better support for joins and ad-hoc queries. Some are open source. Some focus on weblog data while others have robust support for time series and spatial needed for sensor and machine data. Whatever your real-time database needs, there’s probably a good fit for your use case.