Change Data Capture for Real-Time Access to Backend Databases
In a recent post on The New Stack, I discussed the emergence and significance of real-time databases. These databases are designed to support real-time analytics as a part of event-driven architectures. They prioritize high write throughput, low query latency, even with complex analytical queries including filter aggregates and joins, and high levels of concurrent requests.
This highly-specialized class of database, which includes open source variants such as ClickHouse, Apache Pinot and Apache Druid, is often the first choice when you’re building a real-time data pipeline from scratch. But more often than not, real-time analytics is pursued as an add-on to an existing application or service, where a more traditional, relational database like PostgreSQL, SQL Server or MySQL has already been collecting data for years.
In the post I linked above, I also briefly touched on how these online transactional processing (OLTP) databases aren’t optimized for analytics at scale. When it comes to analytics, they simply cannot deliver the same query performance at the necessary levels of concurrency. If you want to understand why in more detail, read this.
But the Internet Is Built on These Databases!
Row-based databases may not work for real-time analytics, but we can’t get around the fact that they are tightly integrated with backend data systems around the world and across the internet. They’re everywhere, and they host critical data sets that are integral to and provide context for many of the real-time systems and use cases we want to build. They store facts and dimensions about customers, products, locations and more that we want to use to enrich streaming data and build more powerful user experiences.
So, what are we to do? How do you bring this row-oriented, relational data into the high-speed world of real-time analytics? And how do you do it without overwhelming your relational database server?
Here’s How Not to Do It
Right now, the prevailing pattern to get data out of a relational database and into an analytical system is using a batch extract, transform, load (ETL) process scheduled with an orchestrator to pull data from the database, transform it as needed and dump it into a data warehouse so the analysts can query it for the dashboards and reports. Or, if you’re feeling fancy, you go for an extract, load, transform (ELT) approach and let the analytics engineers build 500 dbt models on the Postgres table you’ve replicated in Snowflake.
This may as well be an anti-pattern in real-time analytics. It doesn’t work. Data warehouses make terrible application backends, especially when you’re dealing with real-time data.
ETL/ELT is simply not designed for serving high volumes of concurrent data requests in real-time. By nature, it introduces untenable latency between data updates and their availability to downstream consumers. With these batch approaches, latencies of more than an hour are common, with five-minute latencies about as fast as can be expected.
And finally, ETLs put your application or service at risk. If you’re querying a source system (often inefficiently) on a schedule, that puts a strain on your database server, which puts a strain on your application and degrades your user experience. Sure, you can create a read replica, but now you’re doubling your storage costs, and you’re still stuck with the same latency and concurrency constraints.
Change Data Capture (CDC) to the Real-Time Rescue
Hope is not lost, however, thanks to real-time change data capture (CDC). CDC is a method of tracking changes made to a database such as inserts, updates and deletes, and sending those changes to a downstream system in real time.
Change data capture works by monitoring a transaction log of the database. CDC tools read the transaction log and extract the changes that have been made. These changes are then sent to the downstream system.
The transaction log, such as PostgreSQL’s Write Ahead Log (WAL) or MySQL’s “bin log,” chronologically records database changes and related data. This log-based CDC minimizes the additional load on the source system, making it superior to other methods executing queries directly on source tables.
CDC tools monitor these logs for new entries and append them to a topic on an event-streaming platform like Apache Kafka or some other message queue, where they can be consumed and processed by downstream systems such as data warehouses, data lakes or real-time data platforms.
Real-Time Analytics with Change Data Capture Data
If your service or product uses a microservices architecture, it’s highly likely that you have several (perhaps dozens!) of relational databases that are continually being updated with new information about your customers, your products and even how your internal systems are running. Wouldn’t it be nice to be able to run analytics on that data in real time so you can implement features like real-time recommendation engines or real-time visualizations in your products or internal tools like anomaly detection, systems automation or operational intelligence dashboards?
For example, let’s say you run an e-commerce business. Your website runs over a relational database that keeps track of customers, products and transactions. Every customer action, such as viewing products, adding to a cart and making a purchase, triggers a change in a database.
Using change data capture, you can keep these data sources in sync with real-time analytics systems to provide the up-to-the-second details needed for managing inventory, logistics and positive customer experiences.
Now, when you want to place a personalized offer in front of a shopper during checkout to improve conversion rates and increase average order value, you can rely on your real-time data pipelines, fed by the most up-to-date change data to do so.
How Do You Build a Real-Time CDC Pipeline?
OK, that all sounds great. But how do you build a CDC event pipeline? How do you stream changes from your relational database into a system that can run real-time analytics and then expose them back as APIs that you can incorporate into the products you’re building?
Let’s start with the components you’ll need:
- Source data system: This is the database that contains the data being tracked by CDC. It could be Postgres, MongoDB, MySQL or any other such database. Note that the database server’s configuration may need to be updated to support CDC.
- CDC connector: This is an agent that monitors the data source and captures changes to the data. It connects to a database server, monitors transaction logs and publishes events to a message queue. These components are built to navigate database schema and support tracking specific tables. The most common tool here is Debezium, an open source change data capture framework on which many data stack companies have built change data tooling.
- Event streaming platform: This is the transport mechanism for your change data. Change data streams get packaged as messages, which are placed onto topics, where they can be read and consumed by many downstream consumers. Apache Kafka is the go-to open source tool here, with Confluent and Redpanda , among others, providing some flexibility and performance extensions on Kafka APIs.
- Real-time database or platform: For batch analytics workflows like business intelligence and machine learning, this is usually a data warehouse or data lake. But we’re here for real-time analytics, so in this case, we’d go with a real-time database like those I mentioned above or a real-time data platform like Tinybird. This system subscribes to change data topics on the event streaming platform and writes them to a database optimized for low-latency, high-concurrency analytics queries.
- Real-time API layer: If your goal, like many others, is to build user-facing features on top of change data streams, then you’ll need an API layer to expose your queries and scale them to support your new service or feature. This is where real-time data platforms like Tinybird provide advantages over managed databases, as they offer API creation out of the box. Otherwise, you can turn to tried-and-tested ORMs (object-relational mappings) and build the API layer yourself.
Put all these components together, and you’ve got a real-time analytics pipeline built on fresh data from your source data systems. From there, what you build is limited only by your imagination (and your SQL skills).
Change Data Capture: Making Your Relational Databases Real Time
Change data capture (CDC) bridges the gap between traditional backend databases and modern real-time streaming data architectures. By capturing and instantly propagating data changes, CDC gives you the power to create new event streams and enrich others with up-to-the-second information from existing applications and services.
So what are you waiting for? It’s time to tap into that 20-year-old Postgres instance and mine it for all its worth. Get out there, research the right CDC solution for your database, and start building. If you’re working with Postgres, MongoDB or MySQL, here are some links to get you started:
- A practical guide to real-time change data capture with Postgres
- A practical guide to real-time change data capture with MongoDB
- A practical guide to real-time change data capture with MySQL