Data

Apache Bahir Gives Spark Extensions a New Home

12 Jul 2016 8:33am, by

An offshoot from Apache Spark, Apache Bahir is a new top-level project from the Apache Software Foundation designed to curate extensions and plugins related to distributed analytic platforms.

“The Apache Spark project is more focused on the runtime and making sure the platform is very solid, very robust. I think growing a bunch of extensions and having to maintain those extensions might become a burden on the members,” said Luciano Resende, vice president of Apache Bahir and an architect at IBM.

“Making it a separate project provides more flexibility and gives each one an opportunity to focus on what its members consider most important,” Resende said.

So far the nascent project includes four extensions:

  • streaming-akka, an open source toolkit and runtime simplifying the construction of concurrent and distributed applications on the Java Virtual Machine. Both Spark and stream processor Flink are built on Akka, which emphasizes actor-based concurrency.
  • streaming-mqtt, a lightweight messaging protocol for small sensors and mobile devices, optimized for high-latency or unreliable networks. It’s used for remote monitoring and control largely in the IoT market as a higher-performance alternative to the WebSocket protocol.
  • streaming-twitter, which enables the processing of social data from Twitter. An example displays the most positive hashtags by joining the streaming Twitter data with a static RDD (Resilient Distributed Dataset ) of the AFINN word list for sentiment analysis.
  • streaming-zeromq, a high-performance asynchronous messaging library, aimed at use in distributed or concurrent applications. It provides a socket interface allowing users to build their scalable messaging system quickly.

These sub-projects for integrating with different external sources for streaming were moved out of the main Spark project in March.

Apache Bahir also has a strong relationship with different storage layers and intends to reach out to other ASF projects with an invitation to collaborate, Resende said.

“We’re starting with Spark, and a lot of people are coming from Spark, but we’re open to other extensions from distributed platforms. If people from Apache Beam or Apache Flink want to use us as a place to collaborate on their extensions, we’ll welcome them as well,” he said.

It just issued its first release, called Bahir 2.0 preview, which is based on Apache Spark 2.0 preview, as a way for users to get started with it. It will follow closely with an updated release with the launch of Spark 2.0, due shortly.

“We are very interested in streaming-mqtt for remote sensing applications and control/monitoring. We have a lot of Big Data needs in earth science, especially in remote and difficult to access environments, and plugins such as streaming-mqtt from Bahir provide a readily accessible and Apache-based solution to that,” said Chris Mattmann, a member of the Apache Bahir Project Management Committee, and chief architect for the Instrument and Science Data Systems section at NASA Jet Propulsion Laboratory.

There are benefits to making extensions a separate project, Resende said.

“Let’s say you have an extension on an Apache HBase project that’s also Spark,” Resende said. “The problem of being co-located with that is that the release cycle never aligns. You might have a newer version coming out for HBase or a newer version coming out of Apache Spark. And if things change on each side, you’re going to be a little bit behind. By having an extension on a third-party project like Apache Bahir, you have a vehicle to have the release and update your extension when both of the projects come out. If HBase comes out and you have to quickly provide a new functionality for your extension, you can do that outside the release cycle of the other project, which you wouldn’t be able to do if you’re co-located with one or the other project.”

In addition to the kickoff of Bahir, the ASF also announced the 1.0 release of Big Data middleware metadata framework Apache OODT (Object Oriented Data Technology), used for science data processing, information integration, and retrieval.

Created at NASA Jet Propulsion Laboratory in 1998 as a way to build a national framework for data sharing, it entered the Apache Incubator in January 2010. And became a top-level project in November 2010.

Known as “middleware for metadata,” the version 1.0 features include data ingestion and processing; automatic data discovery and metadata extraction; metadata and resource management.

Feature Image: “Home Sweet Home,” by Eddy Van 3000, licensed under CC BY-SA 2.0.

A newsletter digest of the week’s most important stories & analyses.

View / Add Comments

Please stay on topic and be respectful of others. Review our Terms of Use.