A Developer’s Guide to Getting Started with Apache Flink
Demand for Apache Flink is on the rise. Learning Flink could be the stepping stone to career advancement that you’re looking for, but are you ready?
Flink’s APIs are easier to understand if you have an appreciation for the problems its designers were trying to solve, which revolve around managing state for streaming applications in a way that is highly performant, fault-tolerant and scalable. If you are a seasoned developer with wide-ranging experience, perhaps you can just jump in and figure things out as you go. However, the success of a “jump in the deep end and sink or swim” strategy will vary depending on your background.
What Is Flink?
According to the Apache Flink project website:
“Apache Flink is a framework and distributed processing engine for stateful computations over unbounded and bounded data streams.”
Stream processing is a natural way to build software that reacts to the flow of events that underlie modern businesses, such as events providing information about orders, shipments or payments. When Flink operates in its streaming mode, each event is processed immediately instead of being added to a batch for later processing. This enables businesses to react immediately to incoming events.
However, Flink isn’t limited to doing stream processing. It treats batch processing as a special case of streaming where the streams are “bounded.” Flink’s ability to cover batch and streaming use cases with the same framework can be very useful.
Flink cannot control the arrival rate and order of incoming streaming data and must be prepared to handle whatever the data sources provide. This is in sharp contrast to batch processing, where the bounded streams to be processed are fully available before processing begins. This one-event-at-a-time execution model is highly advantageous for some use cases, but it does introduce challenges. Many of these challenges stem from the fact that the Flink framework and your Flink applications should always be running. Outages must be minimized to make the services powered by Flink available.
Rapid recovery from failures would be more straightforward if Flink applications were always stateless, but that would be far too restrictive. While some simple applications don’t need to store anything related to past events, state is needed for many use cases, such as applications that compute summary statistics (e.g., counting events or computing rates such as events per second) or that look for trends in the data. Fraud detection and anomaly detection are good examples of applications that need to process incoming events in a context established by earlier events.
Flink’s role is to process data streams, not store them. A typical Flink application consumes streams of data from a scalable stream storage service such as Apache Kafka and publishes the results wherever they are needed. Flink applications run in a distributed cluster, with a Job Manager supervising the work and one or more Task Managers running the application code.
Flink’s cluster architecture is designed to allow each machine to operate as independently as possible. Whenever coordination or synchronization is required in a distributed computing system, it introduces bottlenecks that will hurt performance.
Since it can be difficult to adapt to Flink’s strong decoupling, it may be tempting to try and circumvent it by creating a backdoor to share state between components. Don’t do this. You are likely to introduce subtle bugs that will show up only occasionally and be very difficult to diagnose. Instead, learn how to compose solutions that stick to Flink’s design philosophy.
- Designing Data-Intensive Applications by Martin Kleppmann
- Stream Processing with Apache Flink: Fundamentals, Implementation and Operation of Streaming Applications by Fabian Hueske and Vasiliki Kalavri
Writing Flink Applications
When you write a Flink application, you are implementing a series of steps in a data processing pipeline. Events stream through this pipeline and are transformed, filtered, combined and enriched at each step. Typical deployments will have many instances of the processing pipeline operating in parallel, each handling some subset of the data. You may be wondering, “How many parallel instances?” Anywhere from a handful to a few hundred instances is common. Some large Flink deployments span thousands of compute nodes and can handle workloads measured in billions of events per second. But don’t worry, everyone has to start somewhere, and it’s okay to start small.
Flink has a few different APIs. In most cases, the best place to get started is with Flink SQL or the Table API. These APIs are more or less equally powerful; the difference comes down to whether you will be expressing your stream-processing logic as SQL statements or as programs written in Java or Python. Either way, you’ll need to understand the concepts from SQL’s relational algebra — projections, filters, joins and aggregations — as they form the foundation of both APIs.
Flink supports SQL, Java and Python. It’s not necessary to have deep expertise in any of these languages before getting started, but it’s helpful to keep these considerations in mind:
- SQL: At a minimum, you should understand SELECT, WHERE, JOIN and GROUP BY.
- Java: You’ll want to have a good grasp of the Java language and its ecosystem. It’s also possible to use another JVM language such as Scala or Kotlin to develop Flink applications using Flink’s Java APIs.
- Python: The PyFlink Table API makes it easy to get started with Flink using Python.
- Learn SQL at Codecademy
- Effective Java by Joshua Bloch
- Head First Java: A Brain-Friendly Guide by Sierra, Bates and Gee
Java Development Tools
When working in Java, you’ll need to use a build tool such as Maven or Gradle to configure and manage your Flink projects and their dependencies. Both of these build tools are popular in the Flink community, and you’ll find quickstarts in the Flink documentation. If you are wondering which to choose, Flink is built using Maven, and you’ll find plenty of examples that use it. Gradle feels like a more modern and powerful approach, but the learning curve is steeper.
If you haven’t worked with a strongly typed programming language (like Java) before, you may not appreciate how much value an IDE like IntelliJ or Visual Studio Code can provide. If you’re not sure where to start, the Community Edition of IntelliJ IDEA is free, has good documentation and is widely used in the Flink community, so you’ll have an easy time getting answers to any questions you may have.
Streaming Data: Connectors, Formats, Schemas and Governance
To get data in and out of Flink, you need to connect your Flink applications to the systems and services your organization uses to produce and consume data. Some systems, such as Apache Kafka, naturally produce and consume streams of data and are ideally suited for use as a stream storage solution for Flink applications.
The data Flink processes will be encoded in a format such as CSV, JSON, Avro or Protobuf. Ideally, the data fields and their types will be described with schemas that can verify that all of the data is valid. As your applications evolve, you’ll need to make changes to these data schemas by adding and removing fields and changing their types, so you’ll want to choose a format that supports schema evolution.
Avro and Protobuf are good choices, as Flink’s SQL and Table APIs have built-in support for these formats and both support schema evolution. By contrast, while CSV and JSON may be convenient for some use cases, Flink does not currently provide support for defining or enforcing schemas for these formats.
- Kafka 101
- Data Governance for Real-Time Data Streams
- Mastering Production Data Streaming Systems with Apache Kafka
Databases and Streaming
Many organizations have data in transactional databases that they want to process with Flink. This might involve shifting data into a data lake, storing features in a feature store for machine learning or enriching streaming data with reference data stored in the database.
One way to stream data from a database into Flink is to use Kafka Connect with the JDBC Source connector to first get the data into Kafka and then stream it into Flink. Another powerful technique is to connect Flink to your database’s Change Data Capture stream using Debezium.
Before putting your first Flink application into production, be sure to read the Production Readiness Checklist in the documentation. This will help you avoid some of the most common mistakes.
The engineering team at Lyft has published a series of articles on the gotchas of stream processing. The first article, covering data skewness, is a good place to start. Data skew happens when messages about a few entities — e.g., users, products or devices — dominate the event traffic. This is a problem affecting the efficiency and throughput of many Flink applications.
Deploying and Troubleshooting Flink
Flink is complex to operate at scale. For most organizations, I recommend using a cloud-based managed service rather than trying to do this yourself. However, if you prefer to do it yourself, take a look at the Flink Kubernetes Operator. For an introduction to metrics, monitoring and alerting, Monitoring Apache Flink Applications 101 is a good starting point.
Flink relies on watermarks to know when enough streaming data has been processed before triggering an action. Failure to set up watermarking correctly can cause streaming applications to either produce no results at all or to ignore some of their input. A good resource for learning about watermarks is Event Time and Watermarks, which will help you avoid a frustrating debugging experience.
Flink can’t always prevent you from writing applications that will misbehave and perform badly, but with experience, it’s easier to anticipate and prevent these problems. Start by learning about which operations require keeping state and roughly how much state is involved. See Stateful Stream Processing with Flink SQL for an introduction to this topic.
Back pressure is a common problem in streaming applications. This occurs when one stage of your processing pipeline produces data faster than the downstream stages can consume it. Flink’s documentation describes how to go about Monitoring Back Pressure, which will help you identify the cause.
When Flink applications show poor performance, it is usually caused by inefficient serialization. If you stick to Flink’s SQL and Table APIs, this shouldn’t be an issue. However, if you experience performance problems, you can use Flink’s built-in Flame Graphs to diagnose them.
Learning Apache Flink can be very rewarding. The framework is powerful and can be used for a wide variety of batch and streaming use cases. Understanding how Flink works will also introduce you to a new way of thinking about how to build computing systems, which will feel like you’ve gained a new superpower.
Once you feel you’re ready to dive into Flink, Apache Flink 101 is a great place to start.