3 Reasons Why You Need Apache Flink for Stream Processing
Streaming data has become a must-have as organizations seek real-time insights to better serve customers, optimize operational efficiency and combat fraud. Confluent’s recent data-streaming report found that 72% of IT leaders surveyed were using data streaming for mission-critical applications, while 89% considered it an important investment.
Platforms like Apache Kafka solve some of the problems associated with streaming data from many origin points to one or more destinations. Kafka is used to build high-performance streaming-data pipelines and ensure that data gets where it needs to go, even amid network or system failures that can obstruct or delay delivery.
But Kafka alone is insufficient, especially when businesses need to alter or analyze data while it’s in motion. Imagine you’re looking for patterns in a stream of stock trade prices that show whether it’s time to buy or sell that stock, or conducting predictive maintenance on data streams from IoT devices. In cases like these, your stack needs both data streaming and stream processing, and that’s where Apache Flink is invaluable.
Apache Flink became an Apache top-level project in 2015, and it is widely used for mission-critical applications. For example, Uber uses Flink to match drivers and riders to calculate an accurate estimated time of arrival, while Netflix uses it to deliver personalized content recommendations to users. In each case, Flink allows these companies to quickly analyze streaming data at massive scale.
Flink’s rich capabilities help manage technical issues that can adversely affect the data stream, such as when data arrives at its destination out of order, arrives more than once, arrives late or surges from a big data spike. For all those situations, Flink is a valuable complement to Kafka that can save you from writing additional code to address those issues.
My goal with this article is to get you up to speed on the basics of Flink so that you can assess if Flink is right for your projects. For a deeper dive into the technology, check out this newly released Apache Flink 101 training course.
Here are three key reasons why Flink is such a valuable complement to Kafka for applications that require stream processing.
1. Apache Flink Tames Real-Time Data
Flink has a number of capabilities for dealing with issues that can arise in certain streaming situations. Here are some of the key technical benefits Flink provides:
- Support for unbounded and bounded data streams: Flink supports the notion of both unbounded (never-ending) and bounded (finite or batch) data streams. With bounded streams, the tabulating is done incrementally as the data is being processed, which is similar to batch and stream processing. Flink treats bounded data streams as a special case of streaming and supports a number of optimizations so the pre- and post-processing can be done much more efficiently. For unbounded data streams, Apache Flink provides state and context, such as time windows, making nonstop processing of those streams in real time possible.
- Event-time and processing-time semantics: Time is an important notion in stream-processing applications. Flink supports event-time processing of data based on timestamps in the data. In the event-time mode, data can be time-ordered and late-data handling can be defined. By contrast, processing time is based on the time that the data arrives at the Flink server, which can be useful in low-latency situations.
- Watermarks for late-data handling: For unbounded streaming data, developers can define watermarks, which specify a minimum amount of data that must be collected before processing can begin. Watermarks have extensions, which allow developers to define methods for processing data that arrives late, after the watermark has been reached.
- Exactly once guarantees: Flink manages the state of stream-processing applications, enabling it to run nonstop with features such as checkpointing that prevent events from affecting the managed state more than once, even after recovering from system failures.
2. Flink Enables Stream Processing at Massive Scale
Flink is known for being the stream processor of choice behind many of the world’s largest real-time systems because it can handle massive amounts of streaming data in real time. In addition to Uber and Netflix, Stripe uses Flink to process its payments, and Reddit employs Flink to keep its users safe from spam and harassment.
Some of the Flink features that enable stream processing at scale include:
- Parallel, distributed processing: Thousands of Flink workers (or TaskManagers) can run in parallel, coordinated by the Flink cluster’s JobManager, and be scaled up or down to handle data spikes or user-base growth in a cost-effective way.
- Fault tolerance: In a large-scale system, server and software issues can disrupt real-time processing. Flink ensures nonstop processing at scale via automatically (and asynchronously) generated checkpoints, which are snapshots of the current state of an application and the positions in its input stream(s). Upon restart after a system failure, Flink restores application state from the most recent checkpoint and rolls back to the correct position in the input stream before processing resumes. Flink continues processing as if the failure never occurred.
- System admin without data loss: Savepoints are checkpoints that are created manually by system administrators, usually before the admin performs a system upgrade or a scheduled maintenance activity. Savepoints allow Flink to resume and catch up from where it left off when the Savepoint was first created.
- System monitoring: To help users operate real-time systems nonstop, at any scale, Apache Flink provides a complete set of metrics needed for monitoring and troubleshooting. It ships with metrics reporters that practitioners can use to integrate with an enterprise metrics solution like Prometheus.
3. Versatility: From Streaming to Batch, and Simple to Complex
Flink has a large, growing collection of developer APIs that enable developers with different skill sets — Java, Python, Scala and SQL — to process streaming data in ways that support a vast variety of use cases. These include personalization, alerting, fraud prevention, process automation, anomaly detection and much more. Some of the developer APIs and libraries that can be accessed include:
- Datastream API: Used to create programs that receive unbounded or bounded (batch) data streams from sources such as Kafka, and then it performs transformations on that data (filtering, updating state, defining windows, aggregating) after which it sends the output to sinks such as a database, file or another Kafka stream.
- Table API and Flink SQL: Used to tabularize data streams in order to perform relational (SQL) operations on them such as querying, joining or filtering, Fink SQL is arguably the most complete SQL implementation for streaming data. It enables developers and even SQL-savvy analysts to transform, analyze and search for patterns in data streams.
- Complex event-processing (CEP) library: Used by Java and Scala developers to detect and act upon patterns in an endless stream of events. CEP patterns might include stock price trends in trading automation, system metrics in predictive maintenance or anomalies in fraud detection.
- Flink machine learning (ML) library: Used to build real-time machine learning (ML) models. With a comprehensive library of ML APIs and infrastructures, it simplifies how engineers build ML pipelines for both training and inference jobs with millisecond-level latency.
Not for the Faint of Heart
While Flink has many benefits, there are also a lot of elements to manage, including parallel and distributed processing, state management, host instance selection and provisioning, containerization (Kubernetes, etc.), configuration, tuning, storage, high availability, monitoring, integration with Kafka and more. Historically, companies that rely heavily on Flink often hire a team dedicated to Flink operations and management.
Using a managed Flink service provider is another way to remedy these challenges. Confluent recently announced an early-access program for our fully managed Flink service. We’re working with a select group of customers to determine the product roadmap and targeting a general release for the future. In the meantime, if you’d like to learn more about Apache Flink before getting started, check out its project website.