Confluent Wants to Make Batch Processing a Thing of the Past
A decade ago, as a staff engineer at LinkedIn, Kreps saw, in his words, “a fundamental disconnect” between how businesses operated and the computing systems that supported these businesses.
At the social media service, all the data was generated all the time, and this was true of pretty much every business Kreps knew of. “Business is fundamentally an activity happening. It’s continuous and happens all day, throughout the day,” he recalled.
Yet, most enterprise data processing still happens in batch processes. Routines that were set to execute once a day, or at some other scheduled time (mostly at night when there was an excess of computing power). Dashboards showed last week’s data (and still do); Bank customers had to wait three days for a money transfer to come through. Databases captured data, which could then be queried at some point later.
“Batch processing logic, I think, is probably on the decline…”
— Jay Kreps
In effect, the Kafka project, founded at LinkedIn, was about bypassing batch-based systems. It provides a high-throughput, low-latency platform for handling real-time data feeds.
Today, Kafka serves as the basis for operations for many webscale companies including Uber and Lyft, PayPal, Twitter and Netflix, all of whom interact with their customer bases in real-time. Kreps went on to co-found Confluent, which offers an enterprise-supported version of the software.
But data steaming data today is still largely seen as a niche, albeit an important one. Kreps wants to see it everywhere, rendering batch processing a relic of the past. And this year, at the company’s annual Current user conference, held in San Jose last week, Kreps introduced a new tool that he feels will win over many more converts to the data stream processing: Apache Flink.
What Is Wrong with Data Streaming?
In his Current keynote, Kreps went over the reasons most people consider data streaming a niche technology. Stream processing is too difficult for developers to interact with — it is not as expressive. It does not scale well, it loses data and is not as efficient as batch processing.
Kreps shared his vision of how ubiquitous data processing would take care of all these issues.
To date, most organizations keep two sets of data processing systems, one for handling events that have already taken place, last week or last month or whenever, and those that act on data as it arrives. Most real-time data systems do not also work on historical data, hence the need for at least two separate analysis and processing systems.
But this does not have to be the case, Kreps argued. A data streaming system can be built such that it works on both historical data through batch query as well as new data as it comes in. As an aside, he argued that even most batch processing systems, such as data warehouses process data through limited stream processes.
There has been a lot of work around building fault-tolerant models for processing in parallel, which makes systems processing systems such as Kafka scalable and transactional.
With Kafka as the streaming hub, everything else can be plugged into it to query and process the data. In effect, it can serve as the database layer for the streaming environment. It was databases that served as the basis for most all enterprise applications.
This is where Apache Flink comes in. It can provide a unified interface for developers to write from, initially using a language everyone is familiar with, SQL. It can be used for event-driven applications, streaming analytics or streaming data pipelines, scaling up to whatever the size of the job needed.
In the long run, Flink will be every bit as important to Confluent as Kafka itself, Kreps said, in a roundtable interview. What Flink does is give developers tools to query the streaming data in the same way they would interrogate a database or data store itself — through SQL and, next year, Python or Java.
What attracted Confluent to Flink was that it was not a separate system for working only with streaming data. In fact, you can use Flink to build business logic from either streaming or batch data. You use the same tools. Unlike other stream processing tools, Flink treats batch, or bounded, data, the same way it treats data streaming, or unbounded data.
“Whether you need to do batch processing or stream processing, you can write your code once and run it with both execution models,” further explained Confluent software practice lead David Anderson, in a technical session at the conference.
Not only does this lower the barrier to entry for those who want to try data streaming, but it ultimately would streamline the business processes by reducing the number of tools needed to a single set.
“If you look at reality, companies are quite federated across different technologies. How do you put that together and make it all feel like one experience to the customers?” Kreps asked.
Another advantage of Flink is its easy scalability, Anderson pointed out. The way the APIs are organized, the developer does not need to worry about managing multiple concurrent threads at once.
“Flink has the runtime architecture needed to achieve really high scale and really high performance. It scales all the way from simple applications, ingesting 1000s of events per day, up to really huge applications ingesting billions of events per second,” Anderson said. “So it’s fault-tolerant, and provides high availability.”
It can handle both stateless applications and stateful ones as well, with the state being saved either in local storage or on a small fast data key-value store such as RocksDB.
Flink can be used in cases where data is being ingested into another system but requires some enhancement. This could be for dashboards, or even for real-time observability and metering. It is also used as the basis of event-driven programming, where one action can trigger another.
In a customer panel with the press, two engineers from website-building giant Wix — Avi Perez, head of backend engineering; and Natan Silnitsky, backend infrastructure tech lead — shared why it was essential to run its event-driven architecture on a streaming platform. The company hosts at least 200 million active websites.
Every time a user hits a button on the site, perhaps to spin up a new service, it triggers hundreds if not thousands of events to deliver that. There is simply no way such an architecture could run, at least not responsively, if it were built on a standard database alone, Silnitsky noted.