Modal Title
Data / Storage

Making Real-Time Data Real: Change Data Capture for Astra DB

Mar 24th, 2022 8:04am by
Featued image for: Making Real-Time Data Real: Change Data Capture for Astra DB
Featured image via Pixabay.

Chris Latimer
Chris Latimer is vice president of product management at DataStax, where he leads the company's product strategy on event streaming and cloud messaging. Prior to DataStax, Chris spent over 20 years working in technology as a software engineer, architect and product manager at companies such as Google, NetJets and Apigee.

DataStax Astra DB’s addition of change data capture could simplify the event-driven architecture.

As a developer, if you’re thinking about your database, there’s a good chance it’s not by choice. Queries are running slow, the data you’re retrieving doesn’t look quite as you expected, there’s an outage and you’re scrambling to get things running again. Ideally, the database just works, and you can focus on your application.

That’s what DataStax strove for when it developed Astra DB, our database-as-a-service built on Apache Cassandra. However, no application is an island, and just as organizations need a developer-optimized way of managing operational data, they also need a way to connect that data throughout their ecosystem in real-time, ideally without pushing those responsibilities down to the application tier.

Today, we are releasing change data capture (CDC) for DataStax Astra DB, our serverless database system for cloud environments.

Built on top of Astra Streaming – DataStax’s Apache Pulsar-based cloud service – the CDC service enables the capture of data changes on individual database tables as they occur and lets organizations stream them anywhere needed: Snowflake, other SQL or NoSQL datastores, Google Cloud Pub/Sub, Kafka, Kinesis, and more.

These capabilities could make it easier for businesses to use real-time data for immediate decision-making and intelligence.

Moving Toward a Unified Event-Driven Architecture

CDC delegates responsibility for publishing events to the database. Since many events that get published in an event-driven architecture coincide with changes that are being committed to the database, this simplifies the role of the application. At the same time, because CDC is configuration-driven, new event types can be implemented simply by enabling CDC on new tables at the database tier.

This eliminates the need for additional development, testing, and release overhead each time a new event is needed. It also eliminates scenarios where applications attempt to future-proof themselves by publishing a large number of events that no downstream systems care about, thus wasting middleware resources and creating copious amounts of code that have a high propensity for code rot.

CDC for Astra DB was built to capture events from an Astra DB database and publish them to downstream consumers. But because CDC for Astra DB publishes those event streams to Astra Streaming, events can be ingested from other sources, including CDC from other databases, directly from applications or even from messaging systems such as Kafka, Kinesis or older MQ platforms.

And because Astra Streaming supports messaging semantics for queuing, pub/sub, event streaming and lightweight stream processing, it provides an alternative to the fragmented, disparate state of data in motion that many enterprises find themselves struggling with today.

How Does CDC Enable Real-Time Data Pipelines?

In addition to event-driven architectures, CDC for Astra DB can also help accelerate data engineering as well. The universal use case is real-time data pipelines. A prime example of this is the need to deliver real-time analytics and reporting by streaming data from Astra DB into a data warehouse, such as Snowflake.

Analytics-Ready Data

Traditional business intelligence has long relied on batch processing to move data from operational data stores into data warehouses for reporting and analytical purposes. But the lag caused by this batch processing is increasingly at odds with the need for up-to-the-moment information.

With CDC for Astra DB, delayed batch jobs can be replaced by immediate updates that automatically stream data changes from Astra DB into a data warehouse solution, providing an accurate picture that always reflects the current state of the data.

Here’s how it works: Applications generate data (ad clicks, payments, and location data, for example) and write that data into their database, in this case shown below, Astra DB. CDC for Astra DB automatically detects changes and pushes them into Astra Streaming, DataStax’s cloud native data streaming and event-stream processing service, for processing.

Using built-in connectors to platforms like Snowflake or any platform with a JDBC (Java database connectivity) interface, updated views of data are pushed into downstream data warehouses. Once they’re populated, the data warehouse creates views for reporting and data transformational capabilities to structure the data in the desired format. This enables the delivery of real-time visibility into key aspects of an organization’s data via reporting tools built into, or that integrate with, the data warehouse.

How CDC for Astra DB populates a data warehouse for real-time reporting (Source: DataStax)

There are several other important use cases where CDC for Astra DB shines, including:

Search Integration

Cassandra is a highly scalable database known for its impressive read and write performance. But sometimes data needs to be moved from Cassandra into a more purpose-built search solution like ElasticSearch. In these situations, CDC for Astra DB simplifies the process and automatically updates your search indexes in real-time.

Operational ML

Data science often involves the analysis of time-series data, which isn’t always easy to capture. With CDC for Astra DB, data scientists can more easily access an event stream of time-series data that represents the changes that have happened on a table-by-table basis.

These time series play a critical role in training ML models, which can be used to extract greater insights and predictive capabilities. While these models are valuable on their own, operationalizing them as part of your data-in-motion strategy can further be achieved by using capabilities such as Pulsar Functions to leverage these models to enrich data in real-time as part of your streaming data pipelines.

CDC for Astra DB also enables organizations to:

  • Build applications that respond to CDC change events to drive business logic in response to changes detected in the Astra database.
  • Integrate with platforms like Twilio or Firebase to send SMS or push notifications when changes occur in an Astra database
  • Gain visibility into anomalous behavior that may indicate a security breach with CDC’s consumable stream of event data.

In total, CDC for Astra DB aims to simplify the lives of developers by reducing complexity and delegating cross-cutting capabilities around data change events to the source of truth: the database.

Group Created with Sketch.
THE NEW STACK UPDATE A newsletter digest of the week’s most important stories & analyses.
TNS owner Insight Partners is an investor in: Enable, Real.