4 Reasons Why Developers Should Use Apache Flink
Apache Kafka has become the go-to platform for streaming data across an enterprise, but streaming is even more valuable when the data can be cleaned, enriched and made available for additional use cases downstream. That’s where stream processing comes in.
Stream processing allows you to continually consume data streams, process them with additional business logic and turn them into new streams that others can repurpose for their own applications. Uses cover a wide variety of applications, including real-time dashboards, machine learning models, materialized views, and event-driven apps and microservices.
The complexity of processing logic differs based on the application at hand and can range from straightforward tasks, like filters and aggregators, to more involved operations, like multiway temporal joins and arbitrary event-driven logic. As a result, the benefits of stream processing over other options (such as periodic batch jobs, ELT, classical two-tiered architecture) differ depending on the use case.
Despite this variation, the key drivers of adoption for stream processing typically fall into one or more of these categories:
- Latency: Stream processing greatly cuts the time between an event happening and the event showing up in your product or user experience, whether that’s a dashboard, machine learning model or other type of application.
- Innovation and reusability: Stream processing turns data products into shareable assets that can be consumed and built on by downstream applications and systems. The data streams become reusable building blocks with well-defined and consistent data access, making it simple for other teams to employ them for new products and applications.
- Cost and resource efficiency: Processing data continuously improves resource utilization by distributing the work over time. In addition, processing data upstream (such as pre-aggregation, sessionization, etc.) greatly reduces costs in downstream systems (such as data warehouse, real-time analytics database, etc.) while also speeding up queries in these systems.
- Expressiveness: Life doesn’t happen in batches. In contrast to periodically scheduled batch jobs, stream processing does not introduce artificial boundaries in your data that overlap into your processing logic.
Four Reasons to Consider Apache Flink
Flink is one of the most active Apache projects, providing a unified framework for stream and batch processing. Digital-first companies like Uber, Netflix and LinkedIn use Flink, as well as more traditional enterprises like Goldman Sachs and Comcast.
Flink also has a large and vibrant contributor community, supported by companies like Apple and Alibaba, that helps to ensure continual innovation. As a result, Flink has enjoyed rapid adoption comparable to that of Kafka in its early days.
Here are four common reasons that companies choose Flink over other stream-processing technologies:
No. 1: It’s a Powerful Execution Engine
Flink boasts a powerful runtime with exceptional resource optimization, high throughput with low latency and robust state handling. Specifically, the runtime can:
- Achieve sustained throughput of tens of millions of records per second.
- Maintain subsecond latency at scale.
- Ensure end-to-end, exactly-once processing across system boundaries.
- Compute correct results even in the case of failures and out-of-order events.
- Manage and, in the event of errors, restore states as large as tens of terabytes.
Flink can be configured for a wide range of workloads depending on the use case, including streaming, batch or a hybrid of the two.
No. 2: Compatibility with Multiple APIs and Languages
Flink offers four distinct APIs that can each cater to different users and applications. Flink also extends support for multiple programming languages, including Python, Java and SQL.
The DataStream API, available in both Java and Python, allows you to create data flow graphs by linking transformation functions like FlatMap, Filter and Process. Within these user-defined functions, you gain access to the fundamental components of a stateful stream processor, such as state, time and events. This provides you with fine-grained control over how records flow through the system and how they read, write and update the state of your application. If you’re familiar with the Kafka Streams DSL and Kafka Processor API (↔ ProcessFunction), the experience will be familiar.
The Table API is Flink’s more modern, declarative API. It enables you to write programs using relational operations like joins, filters, aggregations and projections, in addition to various types of user-defined functions. Similar to the DataStream API, the Table API is supported in Java and Python. Programs developed using this API undergo optimization similar to Flink SQL queries, sharing several features with SQL, such as the type system, built-in functions and the validation layer. This API has parallels to Spark Structured Streaming, Spark’s DataFrame API and Snowpark DataFrame API, although those APIs are geared more toward micro-batch and batch processing than streaming.
Built on the same underlying architecture as the Table API is Flink SQL, a SQL engine that adheres to ANSI standards and can process both live and historical data. Flink SQL uses Apache Calcite for query planning and optimization. It supports arbitrarily nested subqueries, has broad language support including various streaming joins and pattern matching, and comes with an extensive ecosystem including JDBC Driver, catalogs and an interactive SQL shell.
Finally we have “Stateful Functions,” which eases the creation of stateful, distributed event-driven applications. This is a separate subproject under the Flink umbrella and quite different from Flink’s other APIs. The simplest way to think about Stateful Functions is as a stateful, fault-tolerant distributed Actor system based on the Flink runtime.
The broad choice of APIs makes Flink the ideal option for stream processing, and you can mix different APIs over time as your requirements and use cases evolve.
No. 3: Convergence of Stream and Batch Processing
Apache Flink unifies stream and batch processing, because its main APIs (SQL, Table API and DataStream API) support both bounded data sets and unbounded data streams. Specifically, you can run the same program in either batch- or stream-processing mode depending on the nature of the data that is being processed. You can even let the system choose the processing mode for you.
- Only bounded data sources → Batch-Processing Mode
- At least one unbounded data source → Stream-Processing Mode
This unification of stream and batch processing offers tangible benefits for developers:
- Consistent semantics across real-time and historical data-processing use cases.
- Reuse code, logic and infrastructure between real-time and historical data-processing applications.
- Combine historical and real-time data processing in a single application.
No. 4: It’s Ready for Prime Time
Flink is a mature platform that has been battle-tested in the most demanding production use cases. Features that demonstrate this include:
- A metrics system that works out of the box with tools like Datadog and Prometheus, but which can also be integrated with custom solutions.
- Comprehensive support for observability, troubleshooting and debugging via Flink’s Web UI. This includes support for back pressure monitoring, flame graphs and thread dumps.
- Savepoints, which allow you to statefully scale (while preserving exactly-once semantics), upgrade, fork, backup and migrate your application over time.
Flink and Kafka: A Powerful Combination
Flink and Kafka are frequently used together; in fact, Kafka is Flink’s most popular connector. The two are highly compatible, and Kafka in many ways has driven Flink’s widespread adoption.
Note that Flink itself does not store any data; it operates on data stored elsewhere. You can think of Flink as a computation layer for Kafka, powering real-time applications and pipelines, while Kafka serves as the foundational storage layer for streaming data.
Flink has become even more adept at supporting Kafka applications over time. It can employ Kafka as both a data source and a data sink, capitalizing on Kafka’s broad ecosystem and tools. Flink also supports popular data formats natively, including Avro, JSON and Protobuf.
Kafka is an equally good fit for Flink. Compared to other messaging systems like ActiveMQ, RabbitMQ or PubSub, Kafka enables persistent and indefinite data storage for Flink. Furthermore, Kafka allows multiple consumers to read streams concurrently and rewind them as necessary. The first attribute complements Flink’s distributed processing paradigm, while the second is crucial for Flink’s fault-tolerance mechanisms.