Confluent: Exploiting Data Currency with Streaming Data Processing
Real-time data analysis takes on a number of distinct possibilities when augmented by modern streaming data capabilities. Established streaming data platforms issue inordinately fast queries on continually generated data. Subsequently, organizations derive insights from data at the time they’re produced, enabling a degree of currency that’s invaluable to deployments with sub-second latency.
According to Jean-Sébastien Brunner, director of product management at Confluent, such functionality is fundamental for real-time reactions to data-driven events, particularly when the application demands “the latest data, and to make the query as fast as possible. Confluent is really focusing on making sure that we can process that data and run very complex queries on that data as it’s arriving.”
Confluent’s streaming data platform has four principal components for enabling these advantages. It utilizes a tight integration between Apache Kafka and Apache Flink to store and query data, respectively. There’s also a schema registry for managing data and reinforcing aspects of data governance, and hundreds of connectors (many of which are native) to a plethora of sources.
This artful combination supports numerous querying capabilities for real-time data analysis on the most recently generated data. It enables organizations to perform pattern matching, windowing, and issue queries indefinitely—all based on the events contained in data streams.
“Queries can run one time or it can run forever,” Brunner mentioned. “That’s the key to stream processing. That query will be kind of loaded, in-memory, and wait for the data. Whenever you have the event that triggers your query, it will run.”
Confluent’s integration between Kafka and Apache Flink is critical to the platform’s performance. In addition to facilitating data ingestion and data storage, Kafka is the messaging service for subscribing to, and publishing, different data events or data types to consuming applications. “In Kafka you can set how long you want the data there, but technically you can store the data forever and that becomes your source of truth,” Brunner revealed. “And, you can either look at the latest data with Confluent or you can also combine it with data you had before, from batch, or even when you join data together.”
According to Brunner, Flink is responsible for querying the data in Kafka. Confluent’s tight integration of these two open source tools is underscored by its metadata management features. “We actually have some sharing of the metadata [between these systems],” Brunner said. “Whenever you create some object in Kafka it’s visible in Flink, and the converse. So if you create a table in Flink, you have the table in Kafka.” Confluent also provides over 100 pre-built connectors that natively integrate with Kafka, including traditional databases, SaaS applications, and streaming data sources. The tandem enables users to “keep the data that’s already streaming, but you can also bring data in non-streaming form, like databases, and keep the state with CDC [Change Data Capture],” Brunner explained. “With CDC you can get updates in real-time in Kafka and process them.”
Already widely used by some of the most ubiquitous organizations born in the cloud (including Netflix and Uber), Flink’s streaming data flow engine supports complex queries of tremendous scale. Confluent recently unveiled an open preview of Flink on Confluent Cloud, its fully managed service. In Confluent, Flink helps support what Brunner described as “a full ANSI SQL compatible end point where you can query the data at rest in Kafka or in streaming queries.” Flink’s querying capabilities are applicable to both batch and stream processing functions as well, which might include windowing and aggregations, and other constructs. In a fraud detection use case, for example, Flink’s real-time querying can account for numerous data sources while preceding action at the time of the transaction.
“Every single swipe of a credit card can trigger an alerting process, and you take that from the swipe, plus all the history of that customer, for example,” Brunner specified. “It could be a very complex query that looks at customer history, and [the system] will answer if it is something expected or fraud. And, you can set the transaction for less than a second.” Users can also implement complex pattern matching with Flink to perform Complex Event Processing, an especially effective form of real-time streaming data analysis. Flink allows users to define patterns programmatically, declaratively, or by specifying the shape of the pattern. “If you want to look at a double bottom financial pattern, you can say, ‘Flink, go down, up, down, up,” Brunner indicated. “Flink will do the magic, translate this pattern, and look at [everything] in the stock market and make sure it triggers an alert or does what the customer wants.”
Confluent’s schema registry is influential for both data governance and computational purposes. It allows users to specify the shape of the particular schema, which is helpful for establishing standards for data quality. Brunner referenced a use case where for an online order, the registry could be used to formalize the fact that it should appear as “customer name, order ID, and product ID.” These capabilities enable organizations to effectively create structure for the data within the platform and its various components. There are also consequences for aspects of access controls, data privacy, and regulatory compliance. “With the schema registry, you can ensure all the data in your topic follows your own schema,” Brunner pointed out. “You can target, find PII data, request encryption, and put some security rules to make sure, when you’re connected, who can read and write different data.”
There are also downstream implications that advantageously impact computations, stream processing, and querying, since the registry standardizes aspects of how the data is represented and, to a lesser extent, what it means. “When you’re doing the compute you know this is the schema of this data, you know where it is, you know the target and, if you’re authorized to see the data, you can start to do a query,” Brunner said. “You don’t need to request additional access for compute.” A new feature in Confluent, Data Portal, provides additional data access capabilities in addition to functionality for discovering and exploring metadata, tags, and topics.
A Credible Foundation
Confluent’s appeal to the enterprise lies in its seamless integration of two of the most effective tools for stream data messaging, storage, and processing—Kafka and Flink. Its myriad native source connectors are critical for bringing organizations into this ecosystem for real-time data analysis, as is its schema registry for providing enterprise class data governance, access controls, and data privacy. This foundation is credible for not only performing real-time data analysis, but also doing so on the most recently generated data available.