Apache Flink for Unbounded Data Streams
Unless you’re a serious data developer, chances are you haven’t heard of Apache Flink. But I can guarantee that you’ve interacted with it. Booking.com, Pinterest, Stripe, Alibaba and Goldman Sachs are just a few of the companies that rely on Flink.
Why? Because as Alibaba recently explained at the annual Apache Flink Forward conference, every day its immense eCommerce website sees peaks of over 500,000 orders per second. Those are the kinds of transaction numbers you see during the busiest times of America’s Black Friday. And Flink handles it with no fuss. Running Flink jobs on two million CPU cores, Alibaba runs 6.9 billion transactions per second. Name another transactional system that can do that reliably day after day.
There was a time when talking about this kind of scale would seem silly. Billions of transactions? Who does that? Welcome to the 2020s.
Sure, we all still do batch jobs. But when we must track moving cars, nonstop financial transactions, the constant dance of signals between cell phone towers and smartphones, Internet of Things (IoT) devices from industrial sensors to your toaster, and on and on, you need another kind of fast data analysis. That’s where Flink comes in.
This open source framework and distributed processing engine handles a continuous flow of events. Reports from production have Flink applications processing trillions of events per day.
Technically, Flink is used for stateful computations over unbounded and bounded data streams. And what are those, you ask?
You already know about bounded data streams. They’re the data used in a batch job. This is data with a defined start and end. For example, the old-school overnight sale report from all the sales made between 9 a.m. and 5 p.m. yesterday is a bounded data stream, Typically, all the data is ingested before performing any computations.
Unbounded data streams, though, is a horse of an entirely different color, and increasingly it’s how companies want their data ingested. These real-time streams have a start but no defined end. These raw, unbounded streams must be continuously processed. There’s no waiting for all the data to arrive because the data stream never stops coming, and events in the data stream can arrive out of order.
To manage this, Flink has tools like watermarks to manage events ingested out of order, and it runs in all common cluster environments and performs its computations at in-memory speed. There, it runs stateful streaming applications at any scale.
I repeat, “any scale.”
Flink pulls off this trick by parallelizing applications into thousands of tasks that are distributed and concurrently executed in a cluster. Flink applications can leverage all the CPU processing power, main memory, disk and network I/O you can give them. To manage all this, Flink maintains a very large application state. Its asynchronous and incremental checkpointing algorithm ensures minimal impact on processing latencies while guaranteeing exactly once state consistency.
Much of stateful Flink applications’ speed comes from its local state access. The task state is always maintained in memory. If it runs out of RAM, it uses access-efficient storage data structures. The result is that you get very low processing latencies. With Flink, the name of the game is always speed and more speed by making the most efficient use of available memory and storage resources.
By analyzing data while it’s in motion between systems and before it finds a home in a database, information is gleaned from the data instantly while it’s still “in memory.” Thus, data can be processed quickly with high throughput in real time with low processing latency.
Couldn’t you do this with older, slower technologies? Sort of, and we’ve been doing that for decades. But have you noticed where that gets us?
For example, while IT mishaps are far from the only reason we had a miserable summer of 2022 travel season, it certainly played a significant role as well. As Ellen Friedman and Kostas Tzoumas point out in their introduction to Apache Flink, airlines handle huge amounts of real-time data from many sources that must be quickly and accurately coordinated. “For example, as passengers check in, data must be checked against reservation information, luggage handling and flight status, as well as billing. At this scale, it’s not easy to keep up unless you have robust technology to handle streaming data. The recent major service outages with three of the top four airlines can be directly attributed to problems handling real-time data at scale.”
Exactly so. As the global economy moves more and more of its business online, the endless flow of data never stops and grows ever larger. We need Flink.
With Flink, we’re able to create event-driven applications, such as fraud detection and payment processing that are quicker than their predecessors. And as anyone who’s ever waited for a website to respond to a purchase knows, every microsecond faster the process works, the better.
In addition, IoT devices are growing ever more event-driven. For instance, at-home real-time health monitoring will soon be a part of every watch. By giving us true situational awareness, users won’t just get useful information, they’ll get life-saving information.
As for the applications themselves, it helps that the most recent Flink updates enable it to act as a streaming data warehouse by unifying stream and batch top-level application programming interfaces. Flink’s Change-Data-Capture abilities also enable companies that rely on static data stores such as MySQL, MariaDB, Oracle, PostgreSQL and MongoDB to generate streams that can output to the real-time world of Kafka, Pulsar and Clickhouse.
In addition, thanks to Flink SQL advances, developers can use the familiar SQL rather than learn new ways to code to real-time data programs. Developers, data engineers, data scientists and analysts already know the SQL “lingua franca.” Recent Flink SQL improvements also make it easier to migrate between releases, provide a table store for local queries and a REST API, which also unlocks JDBC access to Flink SQL.
Startups and established vendors are building solutions with Flink. My company, Decodable, is one of them.
This means that Flink isn’t just another useful open source program. It represents a sea change in how we work with data. The IT future, from traditional database applications to real-time, streaming programs, we’re still wrapping our minds around belongs to Flink.