The most popular technology isn’t necessarily the best tool for the job at hand, as Slim Baltagi, director of big data engineering at Capital One, has been telling conference-goers about the benefits of Apache Flink.
Capital One developed more than 100 criteria to assess stream-processing tools. In a proof of concept, it found it could replace its proprietary system with Apache Kafka, Flink, Elasticsearch and Elasticsearch’s Kibana data visualization tool, a package that boosts performance and reduced resource consumption.
Though he concedes that neither Flink nor the more popular Spark will be the answer to every big data problem, he puts forth Flink as technology worth a look for data analysis needs.
Apache Flink focuses on stream processing in real time with high throughput and low latency. Its creators cite ease of use, performance, real-time results and better memory management among its strengths.
Contrast this approach to Apache Spark Streaming, rather than being a pure stream-processing engine, uses a process it calls “micro-batching”: fast batch operations on a small part of incoming data during a set period, which can affect latency. Spark since has been adding more capabilities around real-time stream processing with Project Tungsten.
“Flink is basically a combination of features available in the open source world. It can handle streams of very high volume — in a 10-node cluster it was more than 15 million events per second — where [Apache Storm] stops at half a million,” said CEO Kostas Tzoumas CEO of the company behind Flink, data Artisans. Tzoumas said Flink streams offer full fault tolerance, no data loss or duplicates, and fully repeatable results.
It means you to use the stream processor as the primary analytics engine rather than multiple layers so you can simplify your stack, get real-time responses, and you reduce total cost of ownership, he said.
“If you look at traditional architecture, people have a continuous flow of data, and they ingest into a database or into Hadoop for analytics. It becomes increasingly hard to keep up with the volume,” Tzoumas said.
It’s building out the capability to query aggregates inside Flink before it even ends up in a database, so in many cases, a database wouldn’t be required.
“What this means for data management is this transition from a batch-based approach to this streaming approach. There is very clearly a need for a compute platform in this space, for a stream processor that can handle these loads with low latency with a good match for this architecture,” he said.
It’s a good match for any high-volume data stream, such that of telcos, the financial sector, mobile gaming and especially for Internet of Things applications. One use for streaming data would be for keeping track of a set of microservices. Instead of feeding all state back to a database, capture all the events in a single process stream.
“Typical workflows being built with Hadoop and its tools to have a continuous ingression pipeline that drives the file systems, then to schedule Hive jobs or Spark jobs to access those files, then pretend you have continuous output,” he said.
And with Storm, for instance, you’re using batch groups together augmented by a stream processor for real-time results, though those results have to be corrected down the line with a batch processor, he says.
Flink’s focus is on a 24/7 stream processor with a simple-to-use API with which you can build applications that require accuracy in the stream processor.
Based on Database Tech
Flink, German for “quick” or “nimble,” grew out of Berlin’s Technical University in 2009 under the name Stratosphere. It originated as a research project aimed at combining the best technologies of MapReduce-based systems and parallel database systems, then turned its focus to data streaming.
Stephan Ewen and Kostas Tzoumas founded the commercial company data Artisans around the open source project in 2014, the same year it joined the Apache Incubator renamed as Flink.
Flink uses a JVM-based data processing system that operates on serialized binary data, as do other Apache projects including Drill, Ignite and Geode. It boasts of a highly efficient garbage collection as a means to boost performance; a highly efficient data de/serialization stack that facilitates operations on binary data and makes more data fit into memory; and DBMS-style operators that operate natively on binary data yielding high-performance in-memory and with overflow sent to disk as necessary.
Flink can be processed only the parts of the data that have actually changed, significantly speeding up the job. And Flink users also can run Storm topologies to transition between the two.
It has APIs in Java and Scala in v1.0 and a beta version for Python. It runs on YARN and HDFS and has a Hadoop compatibility package. It uses RocksDB, an open source key-value store developed by Facebook, for any state data that needs to be stored.
Going forward, the company is building out SQL capabilities, writing more libraries, improving compatibility with Kafka and HDFS, for databases such as Cassandra, and building out building support for Mesos.
The 11-person Berlin-based data Artisans, including one person in San Francisco, has been focused on expanding its community rather than monetizing the project, Tzoumas said. It has more than 160 contributors.
Flink became a top-level project in January 2015, and version 1.0 was released last month which should help ease doubts about its production-readiness. The company also recently raised €5.5 million (US $6 million) in a Series A round led by Intel Capital.
Though it trails in development projects such as Spark and Storm, Intel’s Ron Kasabian, vice president in the Data Center Group and general manager of Big Data Solutions, called it “one of the most advanced and transformative stream processing systems available in the open source community.”
Capital One and Intel are sponsors of The New Stack.
Feature Image: “200+ mph jet stream screams over Pacific Northwest” by Peter Stevens, licensed under CC BY-SA 2.0.