Real-Time Materialized Views — Oxymoron or Killer Feature?
Materialized views evoke cached results — stale data, which is antithetical to “real time.” Let’s take a look at how recent advancements in vectorized computation have flipped this mode of thinking to implement continuously updated materialized views that always show up-to-date results in response to real-time data feeds, even with complex queries.
Materialized views (MVs) have been around for a long time. They are used to store precomputed results of a query in a table, which can then be accessed much faster than recomputing the query from scratch.
This is an excellent way to lower the strain on computational resources of a system, particularly when a large number of users or applications want to concurrently access the results of the same complex query. However, the downside is that we might end up with stale results in the view that don’t reflect changes to underlying input data.
Most databases deploy materialized views on relatively static data because of the computational cost of executing and refreshing complex queries on real-time data feeds. Vectorization removes this computational constraint by dramatically increasing the speed of query execution. This makes it possible to deploy materialized views that are continuously updated in response to streaming data feeds.
Broadly speaking, I will cover two things in this article:
- Vectorization: Vectorization can speed up query execution by 100 to 1000 times (depending on hardware used) when compared to traditional sequential mode. We will explore how this is possible.
- Continuously updated materialized views: Continuously updated MVs can simplify data engineering by allowing developers to easily deploy end-to-end analytical pipelines that take in real-time data feeds and cascade complex computations through a series of materialized views to deliver fresh insights and results to downstream applications. We will explore this with some SQL code.
I have used Kinetica’s developer edition to execute all the SQL queries in this article. You can find a link to the workbook that contains all the queries here. You can follow the instructions there to try this out on your own.
Vectorization Makes a Continuously Updated Materialized View Possible
The CPU has been the core of the computer for decades, and it follows a mode of execution where each instruction is handled one after the other in sequence. Database systems have evolved alongside this paradigm to use sequential processing to perform calculations. The animation below shows a simplified illustration of how a traditional database might add five to an array of numbers.
But what if you could perform multiple of these calculations in parallel, for instance apply a single instruction (“ADD 5” in this instance) to multiple pieces of data? This paradigm is called SIMD, which stands for single instruction, multiple data.
Vectorization is a technique used to implement SIMD instructions. It involves breaking down a program or algorithm into smaller pieces that can be executed in parallel using vector instructions.
GPUs have used vectorized processing of data since their inception to render graphics on a whole screen simultaneously instead of changing one pixel at a time. For a long time, this parallel processing capability was limited to delivering a glitch-free graphics experience. However, in 2007, NVIDIA released the CUDA API, which opened up the use of GPUs for general purpose computing (things other than just graphics). This was followed by Intel’s Advanced Vector Extensions (AVX), which opened up parallel processing on the CPU.
As a result, today we can write vectorized functions that are executed 100 to 1000 times faster on commonly available NVIDIA GPUs and Intel CPUs with AVX. This brute force computational power makes it possible to execute and maintain materialized views on streaming real-time data.
A Continuously Updated Materialized View Looks the Same But …
A continuously updated materialized view that handles real-time data feeds looks exactly the same as any other materialized view. The only difference is that it most likely will not execute on a traditional nonvectorized database. Broadly speaking, there are two parts to a materialized view: the refresh mode and a select statement that performs the query.
The code below shows a materialized view that computes the distance traveled by different ships in the Gulf of Mexico region. The refresh mode is set to
ON CHANGE. It listens for changes to the
ship_tracks table and recomputes whenever that table is changed.
CREATE OR REPLACE MATERIALIZED VIEW track_lengths
REFRESH ON CHANGE AS
ROUND(ST_TRACKLENGTH(Y, X,TIMESTAMP, 1) / 1000) AS track_length_km
GROUP BY TRACKID;
This query does not stand out in the way it is written. But it does in terms of execution.
The input table for this view —
ship_tracks — is a real-time feed of ship locations that receives about 40 new records per second. Because the materialized view is set to refresh on change, the query will have to recompute and update track lengths continuously as new data on ship locations enters the database.
Figure: Ship tracks in the Gulf of Mexico region
This is only possible because the
ST_TRACKLENGTH function used in the materialized view is vectorized and executes really fast. Older, nonvectorized implementations will struggle to maintain this view on real-time data, particularly at scale.
Compose Always-On Analytical Pipelines
One way to solve a complex problem is to break it into simpler sequential steps, where each step is easy to understand and explain in isolation.
Continuously updated materialized views extend this further by cascading updates down this sequence of simple steps or views. This can be used to create an always-on analytical pipeline that cascades changes from one view to another without any user input. Downstream applications can simply point to the last view in this chain of views to get the most up-to-date version of the final result.
For instance, let’s say we want to identify the five longest tracks and display them on a chart. The query we saw in the previous section calculates the total distance traveled by each ship. But since this is an aggregate statistic, it does not include information about the actual ship locations, since the table was summarized to a single row per ship. So we need another query that takes the track IDs of the five longest tracks from the
track_lengths view that we created earlier and filters the corresponding rows from the
ship_tracks table that contains all the relevant data about ship locations.
CREATE OR REPLACE MATERIALIZED VIEW single_track
REFRESH ON CHANGE AS
SELECT TRACKID, x, y, TIMESTAMP
WHERE TRACKID IN (SELECT TRACKID FROM track_lengths ORDER BY track_length_km DESC LIMIT 5)
The key thing to note here is that this view is also set to refresh on change. So it too will get automatically updated whenever there are changes to the inputs — the track_lengths view, which is also being continuously updated in response to new data.
Write Once, Do Forever
The key takeaway here is that, with vectorized processing, we can set up an analytical pipeline that hooks onto a streaming source, cascades updates down a series of materialized views to deliver an always-on and always-current set of results for downstream applications.
Once we write these queries and set everything up, we don’t have to go back to update anything based on changes to the data. This simplifies data engineering, enhances experiences for end users, reduces complexity and lowers maintenance costs.
Try This on Your Own
You can try all of this on your own for free by downloading the workbook for this article here and importing it into Kinetica Cloud or Kinetica’s Developer Edition. I preconfigured all the data streams so you don’t have to do any additional setup to run this workbook.
Both Kinetica Cloud and the Developer Edition are free to use. The cloud version takes only a few minutes to set up and it is a great option for quickly touring the capabilities of Kinetica. The Developer Edition is also easy to set up and takes about 10 to 15 minutes and an installation of Docker. The Developer Edition is a personal copy that’s free forever. It runs on your computer and can handle real-time computations on high-volume data feeds.