ClickHouse: Optimizing Real-Time Data Analysis with Online Analytical Processing
There are some shared characteristics of databases specializing in real-time data analysis, particularly those designed for Online Analytical Processing (OLAP). Most are written in C++. Many utilize a columnar orientation for organizing the data. A number of them happen to be open source.
Nevertheless, when it comes to swiftly analyzing data for real-time metrics for observability use cases, fraud detection, and other applications with low latency, there’s simply no getting around the basics.
“It’s how you organize the data on disk,” explained Tanya Bragin, vice president of product at ClickHouse. “It’s how you access it: how exactly you prioritize which data gets into level one cache versus level two cache and gets processed faster, closer to the CPU.”
Within those parameters, there are several optimizations for retrieving, storing, and querying data that are responsible for the rapidity behind contemporary solutions for real-time data analysis. ClickHouse — an open source OLAP database management system — entails numerous optimizations pertaining to data compression, queries, and cache utilization, which reduce the amount of time required to access and analyze data.
When paired with some of its capabilities for data federation, materialized views, and fast lookups, these optimizations are critical for swiftly analyzing data for any array of contemporary use cases.
Vectorized Query Optimization
The capability to optimize queries on the underlying data is essential to the utility ClickHouse furnishes enterprise users. One such technique the database relies on for this purpose is known as vectorized query optimization, which enables “vectorized access to the data so you’re sending not just individual records, but vectors of data in a streaming fashion through the CPU,” Bragin revealed.
According to Bragin, this form of vectorization isn’t directly related to that for creating vectors or embeddings for language models or search purposes. However, it’s one of the many techniques ClickHouse invokes to provide the analytics speed necessary for real-time deployments. It also has certain advantages pertaining to cost and resource utilization.
“It’s about cache utilization: keeping the data very close to the CPU so you don’t have to pay the penalty of going to more expensive memory and RAM, and paying that access when you process the data,” Bragin divulged. This type of query optimization is partly possible because ClickHouse stores and processes data with a columnar orientation. For many applications of real-time analytics “none of these queries get processed in one go,” Bragin added. “You’ve got to pipeline their processing.”
The way data is stored is a key determinant for facilitating analytics fast enough for real-time applications. In addition to utilizing a columnar format, ClickHouse has a number of compression formats for different data types that reinforce the benefits of its columnar orientation. This fact contributes significantly to the vendor consistently having what Bragin termed a “compression ratio that often exceeds that of optimized systems of other analytical databases.” The capacity to effectively compress — and decompress, when necessary — data factors into the overall expedience ClickHouse delivers to users, particularly in relation to query speeds.
Compressing data helps reduce storage costs, while decompression occurs prior to querying data. ClickHouse’s compression formats enable “a combination of having very optimized compression but not a measurable impact to query speeds,” Bragin commented. According to Bragin, the database has compression formats for “dozens, if not hundreds,” of data types. “I’m talking about bit by bit compression,” Bragin specified. “Like, how you compress a floating point effectively or integers effectively. We’ve got specific compression formats that are really in different types of data to optimize that to the nth degree.”
The ability to support materialized views also reinforces the overall efficiency and speed ClickHouse sustains. Materialized views are often a preferable alternative to table scans which, when applied to the scale, frequency, and speed of real-time data analytics for enterprise use cases like adtech deployments, quickly become costly.
In addition to providing materialized views of tables, ClickHouse is able to update the information in them with low latency, which makes real-time analytics on them feasible. “It’s not a static view,” Bragin disclosed. “It’s a view that works as a trigger on an insert so that as your real-time data gets inserted, that view gets automatically updated with the data.”
It’s even possible for users to “chain” together materialized views, which creates materialized views of materialized views. Issuing queries on materialized views minimizes resource utilization while spurring query speed. When running queries without materialized views, queries oftentimes require reading copious amounts of data on disk.
“For these queries, you want to minimize that,” Bragin said. “If you know you’re going to have a lot of queries resulting in that kind of behavior, even if done fast, it will consume a lot of memory and CPUs. So, you can optimize your system resources by materializing a view that helps the queries run faster, and not scan as much data live.”
Lookups and Federation
ClickHouse supports joins, aggregation queries on top of them, and fast lookups that Bragin referred to as “dictionaries.” With this paradigm, there’s a smaller lookup table and a join with a much larger table. According to Bragin, “that kind of join is very much optimized through this concept of dictionaries. You end up just looking something up, like a dictionary.” The database is also endowed with what Bragin alternately referred to as a federation engine or an integration engine. This construct can query external sources in which the data are stored in an S3 bucket or some other system.
Such functionality is useful for querying data in data lakes. Additionally, users may opt to involve the engine in bringing the data from external sources back into ClickHouse. “You may need to reach out to other systems to grab the data,” Bragin said. “Maybe even in an ongoing workflow where you’re taking the data from other systems, maybe transforming it into an analytical format in ClickHouse, and then running queries on it.”
The federation capabilities ClickHouse supports are indicative of a growing movement to retrieve, access, and query data in external systems using OLAP technologies. With the consistent usage of data lakes for data science and other applications, this development certainly shows some promise.
In any case, the ability to accelerate the rate at which data is queried and accessed via techniques like vectorized query optimization, specialized compression formats for discrete data types, and materialized views can considerably help organizations with their data-driven investments.