What news from AWS re:Invent last week will have the most impact on you?
Amazon Q, an AI chatbot for explaining how AWS works.
Super-fast S3 Express storage.
New Graviton 4 processor instances.
Emily Freeman leaving AWS.
I don't use AWS, so none of this will affect me.
Data / Open Source

4 Reasons Why Developers Should Use Apache Flink

Flink is one of the most active Apache projects, providing a unified framework for stream and batch processing.
Nov 1st, 2023 8:08am by
Featued image for: 4 Reasons Why Developers Should Use Apache Flink
Image from Pixabay.

Apache Kafka has become the go-to platform for streaming data across an enterprise, but streaming is even more valuable when the data can be cleaned, enriched and made available for additional use cases downstream. That’s where stream processing comes in.

Stream processing allows you to continually consume data streams, process them with additional business logic and turn them into new streams that others can repurpose for their own applications. Uses cover a wide variety of applications, including real-time dashboards, machine learning models, materialized views, and event-driven apps and microservices.

Stream processing augments data streams with additional business logic, turning them into new data streams that can be reused in other applications and pipelines.

The complexity of processing logic differs based on the application at hand and can range from straightforward tasks, like filters and aggregators, to more involved operations, like multiway temporal joins and arbitrary event-driven logic. As a result, the benefits of stream processing over other options (such as periodic batch jobs, ELT, classical two-tiered architecture) differ depending on the use case.

Despite this variation, the key drivers of adoption for stream processing typically fall into one or more of these categories:

  • Latency: Stream processing greatly cuts the time between an event happening and the event showing up in your product or user experience, whether that’s a dashboard, machine learning model or other type of application.
  • Innovation and reusability: Stream processing turns data products into shareable assets that can be consumed and built on by downstream applications and systems. The data streams become reusable building blocks with well-defined and consistent data access, making it simple for other teams to employ them for new products and applications.
  • Cost and resource efficiency: Processing data continuously improves resource utilization by distributing the work over time. In addition, processing data upstream (such as pre-aggregation, sessionization, etc.) greatly reduces costs in downstream systems (such as data warehouse, real-time analytics database, etc.) while also speeding up queries in these systems.
  • Expressiveness: Life doesn’t happen in batches. In contrast to periodically scheduled batch jobs, stream processing does not introduce artificial boundaries in your data that overlap into your processing logic.

Four Reasons to Consider Apache Flink

Flink is one of the most active Apache projects, providing a unified framework for stream and batch processing. Digital-first companies like Uber, Netflix and LinkedIn use Flink, as well as more traditional enterprises like Goldman Sachs and Comcast.

Flink also has a large and vibrant contributor community, supported by companies like Apple and Alibaba, that helps to ensure continual innovation. As a result, Flink has enjoyed rapid adoption comparable to that of Kafka in its early days.

Flink’s growth has approximately matched Kafka’s at the same stage in its life cycle.

Here are four common reasons that companies choose Flink over other stream-processing technologies:

No. 1: It’s a Powerful Execution Engine

Flink boasts a powerful runtime with exceptional resource optimization, high throughput with low latency and robust state handling. Specifically, the runtime can:

  • Achieve sustained throughput of tens of millions of records per second.
  • Maintain subsecond latency at scale.
  • Ensure end-to-end, exactly-once processing across system boundaries.
  • Compute correct results even in the case of failures and out-of-order events.
  • Manage and, in the event of errors, restore states as large as tens of terabytes.

Flink can be configured for a wide range of workloads depending on the use case, including streaming, batch or a hybrid of the two.

No. 2: Compatibility with Multiple APIs and Languages

Flink offers four distinct APIs that can each cater to different users and applications. Flink also extends support for multiple programming languages, including Python, Java and SQL.

Flink offers several layered APIs with varying levels of abstraction, allowing it to handle both common and more unusual use cases.

The DataStream API, available in both Java and Python, allows you to create data flow graphs by linking transformation functions like FlatMap, Filter and Process. Within these user-defined functions, you gain access to the fundamental components of a stateful stream processor, such as state, time and events. This provides you with fine-grained control over how records flow through the system and how they read, write and update the state of your application. If you’re familiar with the Kafka Streams DSL and Kafka Processor API (↔ ProcessFunction), the experience will be familiar.

The Table API is Flink’s more modern, declarative API. It enables you to write programs using relational operations like joins, filters, aggregations and projections, in addition to various types of user-defined functions. Similar to the DataStream API, the Table API is supported in Java and Python. Programs developed using this API undergo optimization similar to Flink SQL queries, sharing several features with SQL, such as the type system, built-in functions and the validation layer. This API has parallels to Spark Structured Streaming, Spark’s DataFrame API and Snowpark DataFrame API, although those APIs are geared more toward micro-batch and batch processing than streaming.

Built on the same underlying architecture as the Table API is Flink SQL, a SQL engine that adheres to ANSI standards and can process both live and historical data. Flink SQL uses Apache Calcite for query planning and optimization. It supports arbitrarily nested subqueries, has broad language support including various streaming joins and pattern matching, and comes with an extensive ecosystem including JDBC Driver, catalogs and an interactive SQL shell.

Finally we have “Stateful Functions,” which eases the creation of stateful, distributed event-driven applications. This is a separate subproject under the Flink umbrella and quite different from Flink’s other APIs. The simplest way to think about Stateful Functions is as a stateful, fault-tolerant distributed Actor system based on the Flink runtime.

The broad choice of APIs makes Flink the ideal option for stream processing, and you can mix different APIs over time as your requirements and use cases evolve.

No. 3: Convergence of Stream and Batch Processing 

Apache Flink unifies stream and batch processing, because its main APIs (SQL, Table API and DataStream API) support both bounded data sets and unbounded data streams. Specifically, you can run the same program in either batch- or stream-processing mode depending on the nature of the data that is being processed. You can even let the system choose the processing mode for you.

  • Only bounded data sources → Batch-Processing Mode
  • At least one unbounded data source → Stream-Processing Mode

Flink can bring together stream and batch processing within the same platform.

This unification of stream and batch processing offers tangible benefits for developers:

  • Consistent semantics across real-time and historical data-processing use cases.
  • Reuse code, logic and infrastructure between real-time and historical data-processing applications.
  • Combine historical and real-time data processing in a single application.

No. 4: It’s Ready for Prime Time

Flink is a mature platform that has been battle-tested in the most demanding production use cases. Features that demonstrate this include:

  • A metrics system that works out of the box with tools like Datadog and Prometheus, but which can also be integrated with custom solutions.
  • Comprehensive support for observability, troubleshooting and debugging via Flink’s Web UI. This includes support for back pressure monitoring, flame graphs and thread dumps.
  • Savepoints, which allow you to statefully scale (while preserving exactly-once semantics), upgrade, fork, backup and migrate your application over time.

Flink and Kafka: A Powerful Combination

Flink and Kafka are frequently used together; in fact, Kafka is Flink’s most popular connector. The two are highly compatible, and Kafka in many ways has driven Flink’s widespread adoption.

Note that Flink itself does not store any data; it operates on data stored elsewhere. You can think of Flink as a computation layer for Kafka, powering real-time applications and pipelines, while Kafka serves as the foundational storage layer for streaming data.

Within the data-streaming stack, Flink can manage computational needs while Kafka provides the storage layer.

Flink has become even more adept at supporting Kafka applications over time. It can employ Kafka as both a data source and a data sink, capitalizing on Kafka’s broad ecosystem and tools. Flink also supports popular data formats natively, including Avro, JSON and Protobuf.

Kafka is an equally good fit for Flink. Compared to other messaging systems like ActiveMQ, RabbitMQ or PubSub, Kafka enables persistent and indefinite data storage for Flink. Furthermore, Kafka allows multiple consumers to read streams concurrently and rewind them as necessary. The first attribute complements Flink’s distributed processing paradigm, while the second is crucial for Flink’s fault-tolerance mechanisms.

Eager to Learn More about Flink?

For a deeper dive, try your hand at the Flink 101 course at the Confluent Developer site or this Apache Flink training.

Group Created with Sketch.
TNS owner Insight Partners is an investor in: Pragma.
THE NEW STACK UPDATE A newsletter digest of the week’s most important stories & analyses.