Apache Flink for Real Time Data Analysis

In this latest podcast from The New Stack, we explore Apache Flink, a platform for running both batch and real-time streaming data analysis jobs. This recording is the first in a three-part series on a new managed service Amazon Web Services is unveiling built on Flink. Subsequent episodes will focus on the managed service itself and on the customer experience.
Joining the podcast is Danny Cranmer, who is a principal engineer at Amazon Web Services, as well as an Apache Flink PMC and Committer, and Hong Teoh, a software development engineer at AWS.
Flink is a high-level framework that can be used to define data analytics jobs. It supports bounded (“batch”) and unbounded (“streaming”) data sets. It provides a set of APIs against which you can build analysis jobs using Java, Python, SQL and other languages. In addition to the framework, there is an engine to run the jobs. Jobs run in a distributed manner with fault tolerance and horizontal scaling capabilities.
Extract-Transform-Load (ETL) is one use case, in which raw data is gathered and formatted for a particular workload. Flink is good for doing this job quickly when you need the results immediately. Flink is built for such an unbounded data stream and can offer really low latency for the transforms.
“So the moment your data is generated, you will process it and output it,” Teoh said. ” Let’s say you wanted to take in your data in real-time, you want to kind of analyze what’s happened, generate some insights, maybe it’s business data, or sports data, right? And you want to display that on a dashboard. Flink is very good at that because you can analyze things immediately on the fly.”
Event-Driven Applications also rely on Flink as well. In this scenario, an event, such as a user looking up the weather, triggers an immediate reaction, such as the weather data being served up.
Streaming data, unlike batch data, is constantly being updated with new values. As a result, it needs to be processed incrementally, and the results are delivered incrementally. Fink can do both batch processing and streaming with the same SQL commands.
Like any good real-time data processing system, Flink guarantees exactly once processing, or the ability to avoid duplicates, Cranmer explained. As with any distributed system, a transaction may be processed separately by two different nodes. In fields such as banking, this fault could lead to duplicate transactions — clearly not a good idea. It also periodically checkpoints a job so if any node fails, it will automatically return to the last known good state
In addition to the Flink architecture, we also discuss the AWS role in maintaining the open source project and the future of Flink.