Yes, it’s been a bit like herding cats, but the ODPi (Open Data Platform initiative) is releasing its runtime specification and test suite to ensure applications will work across multiple Apache Hadoop distributions.
The spec was designed to standardize the growing diversity of Hadoop interfaces, a diversity that is proving to be problematic for developers.
“If you’re building and testing an application, it’s just not clear what you should be running it against. For a developer, it can be kind of like a bet you have to place without getting all the data points up front,” said Roman Shaposhnik, director of open source strategy at Pivotal. “Are you testing your app against Cloudera, Hortonworks or Pivotal or IBM? That’s pretty impossible to decipher. It’s not like you can do the Java approach where you can test once and run anywhere. Most of the major Hadoop distributions have been rather incompatible with each other.”
Pivotal unveiled the initiative, then called the Open Data Platform (ODP), in February 2015 to standardize around core Hadoop technologies. The effort has more than 25 vendor members including IBM, EMC, Teradata and VMware, though Cloudera, which has not joined, has called the effort a marketing ploy.
The effort rebranded as ODPI last September when it became a Linux Foundation Collaborative Project.
“Everyone agrees where we need to go, and they’re willing to come together to figure out how we get there. That’s half of the problem there,” said said John Mertic, senior program manager for OPDi, who concedes “There are some that agree there’s a problem, but they disagree with the way we’re going about trying to solve it.”
The group plans to offer two specifications: the runtime and an operations specification. The runtime includes guidelines for four Hadoop core components: Hadoop Common, HDFS (Hadoop Distributed File System), YARN and MapReduce
“The runtime is all about how Hadoop as a platform needs to behave if you’re an application developer, an ISV or system integrator. What are the APIs you can expect from Hadoop?” Shaposhnik said.
Given how big the surface area of the Hadoop APIs is, it’s nearly impossible to produce a formal document on standard, he said.
“So we started with this idea that we would get going by taking a version of Hadoop that the vendors in ODPI were comfortable with, and that was mainly an exercise for the vendors, but we obviously had a lot of input from ISVs, application developers and integrators. We ended up standardizing on Hadoop 2.7. It’s kind of common ground between latest, cutting edge and stable enough,” he said. “When you say that any ODPI release will standardize on Hadoop 2.7, it doesn’t sound like such a big deal, but it’s actually a huge deal.”
Three Major Building Blocks
The specs allow a user to expect a certain set of APIs without trying to figure out how to work around different incompatibilities in the upstream Hadoop releases. That was the first big building block.
“We also kind of went through everything historically that’s been important to application developers and tried to standardize on things explicitly. We didn’t standardize on everything — some of it will default to Hadoop 2.7 — but we went out of our way to call out the features that we feel would be important for ODPI members to keep an eye on to make sure the next release doesn’t break compatibility in these key areas. So the second building block is that we’ve identified a number of key compatibilities where we’ve made explicit statements about how Hadoop needs to behave,” he said, adding that it’s similar to documents about Internet standards.
In addition to standardizing APIs around Hadoop, ODPi also created a reference implementation and a validation test suite.
“We basically developed tests [tests that ensure that any Hadoop implementation conforms to this specification], but those tests have to be executed against something, and that’s the reference implementation,” he explained.
The ODPi test framework and self-certification also aligns closely with the Apache Software Foundation by incorporating Apache BigTop for comprehensive packaging, testing, and configuration. More than half the code in the latest Big Top release originated in ODPi.
Relieving Developer Pain
The effort has been highly focused on developer struggles, Mertic said.
“If you are building an application that sits on top of Hadoop, there are a number of contention points, points that are just not helpful,” said Shaposhnik.
The effect has been a constraint on the Hadoop ecosystem, he said, “And when you think about it, the platform is only as good as the applications it can support.”
He explained some of the things the group made explicit statements about:
“Hadoop is extremely pluggable. It allows you to do a lot of configuration at runtime that will severely impact what kind of APIs you’re getting,” he said.
“We went out of our way to hit the high-impact ones, such as what kind of compression you can use on HDFS. That’s very important to application developers because a lot of times they have to know what kind of compression can be depended upon because that will be critical to the performance of their application. Things like Snappy compression can either be turned on or turned off. If you run it on a Hadoop cluster that doesn’t support it and your application expects it, you’re basically hosed because the application cannot even read data from HDFS. So we stipulated that Snappy compression has to be enabled.”
It also made some stipulations around Windows compatibility, an area that had been largely overlooked, Shaposhnik said.
The runtime does not deal with Ambari, the Apache software for managing Hadoop environments at scale; that will be part of the operations spec, which the group hopes to release this fall. The next release of Ambari is due this summer.
“We’re really trying to innovate with the management spec by asking ourselves a few questions that haven’t explicitly been asked by the industry. One is: How does it look to run and operate Hadoop in the cloud?” Shaposhnik said.
“Depending on who you talk to, you get different answers, and [while] some companies like Amazon give you services; they don’t necessarily even tell you how they’re managing it.
“But I think we’re at the point where the cloud future is heterogeneous for everybody. You can’t just say, ‘I’m on Amazon and to hell with the rest of the clouds.’ As a Hadoop vendor, you have to have an answer, a heterogeneous cloud story for managing Hadoop,” Shaposhnik said.
IBM, Pivotal, and VMware are sponsors of The New Stack.