Analysis / News / Technology / /

Apache Gets Yet Another Stream Processing Engine with Apex

4 May 2016 9:52am, by

The recent promotion of DataTorrent’s Apex to an Apache Software Foundation top-level project gives the foundation yet another open source engine for real-time data stream processing, alongside the likes of Storm, Spark, Samza and Flink.

These projects have grown out of the heavy real-time data analysis needs of companies such as Twitter (Storm) and LinkedIn (Samza). Each has its own approach, and Apex is no different. The technology came from the needs of DataTorrent — a company founded in 2012 by Yahoo alums — to build a stream-processing engine to run natively on Hadoop, utilizing the then recently-released Hadoop resource management platform YARN.

DataTorrent’s focus was on building a platform that was easy to deploy, requiring nothing beyond everyday IT expertise, and that could easily integrate with an existing IT infrastructure and ensure ease of migration. DataTorrent donated the technology to the Apache incubator in August 2015.

Real-time data processing is a hot technology area. IBM’s big bet on Spark announced last summer certainly added to its momentum, and Storm remains better known as well, but Apex is making a name for itself with its speed and ease of use.

GE’s Predix IoT cloud platform uses Apex for industrial data and analytics, as does Capital One for real-time decisions and fraud detection.

Apex and Flink are conceptually similar, according to Thomas Weise, DataTorrent’s vice president of Apex (not to be confused about the serverless computing framework of the same name). Both Apex and Flink can do batch processing, but are more focused on streaming. And though he concedes it might make sense to merge the two projects, he doesn’t see that happening, primarily because of their different roots.

He says Apex is a bit more advanced than Flink because it’s been used more in production. “It’s been hardened a bit more,” he said.

“Ease of use is something we’ve paid attention to from the beginning” –  Thomas Weise said.

Though better known, Spark and Storm are considered difficult to implement and operate.

Storm is a streaming architecture, but is not stateful and lacks in processing semantics and efficiency. Spark Streaming (Spark is much broader) is built on a batch architecture, and cannot do low-latency use cases. The user also has to write extra code for state management. Apex and Flink can do low-latency processing, and don’t suffer a latency overhead for having to schedule batches repeatedly.

Storm has limited fault tolerance, Weise charged, and Spark and Storm as less reliable and less efficient for streaming. At the same time, he boasted Apex’s streaming capabilities, fault tolerance, processing semantics, partitioning operators and more.

“We have partitioning schemes that other projects don’t offer, such as building parallel pipes that can do redundant processing. Then if you have a failure, they can reset independently without resetting the typology. So you can do low-latency processing with an SLA on that,” he said.

Apex's application development is based on a Direct Acrylic Graph model, made up of streams (data tuples) and operators.

Apex’s application development is based on a Direct Acrylic Graph model, made up of streams (data tuples) and operators.

Users can also change things at runtime without taking the application down. “Ease of use is something we’ve paid attention to from the beginning,” Weise said. The company “put a lot of thought not only in the building of applications but what it takes to put it into production.”

Also easing the deployment is how Apex enables developers to write or re-use generic Java code to express any business logic, minimizing the specialized expertise needed to write Big Data applications.

“We feel it’s good to be Java-based, because there’s a huge pool out of talent out there, much more than for Scala or other languages. You also have a huge ecosystem of libraries in the Java space,” he said.

Apex’s Malhar library includes reusable operators (functional building blocks) to popular file transfer protocols, databases and messaging queues including FTP, NFS, and JMS, Kafka, RabbitMQ and NoSQL databases such as Cassandra, HBase and MongoDB.

In production, Apex can rapidly redistribute work from nodes that malfunction, while automatically recognizing new ones.

“When it’s running, it’s just crunching data; there’s no scheduling, no interruption, just performing computation,” Weise said. “The only time something changes, if there’s a failure, there will be recovery, we get replacement resources and restore the state. The user can also write codes to change partitions at runtime, using metrics, say when the latency has become too high or when you want to add resources or reduce resources.”

Going forward, the community plans to add support for (anti-)affinity functions. “That’s where you can say, ‘These things cannot run on the same machine, or they absolutely must run on the same machine,’” Weise explained.

It also plans to add iterative processing, where you can channel a result of computation back in later as an input and make a loop in your processing graph. Integration with Apache Samoa, a distributed machine learning library, is in the works, as is a simplified configuration for the users on things like supporting encryption natively.

“We will also look at supporting Mesos,” Weise said. “We haven’t done that yet due to lack of demand, but people have started asking for it.”


A digest of the week’s most important stories & analyses.

View / Add Comments

Please stay on topic and be respectful of others. Review our Terms of Use.