Pachyderm Challenges Hadoop with Containerized Data Lakes
The folks at Pachyderm believe there’s an elephant in the room when it comes to data analytics, namely the weaknesses of Hadoop. So they set out to build an Hadoop alternative based on containers, version one of which has just been released.
The Pachyderm stack uses Docker containers as well as CoreOS and Kubernetes for cluster management. It replaces HDFS with its Pachyderm File System and MapReduce with Pachyderm Pipelines.
The company has focused on fundamental principles including reproducibility, data provenance and, most importantly, collaboration – a feature they say has been sorely missing from the Big Data world and one that has generated the most excitement from potential users, according to CEO and co-founder Joe Doliner.
To that end, the developers looked to the Git model to create a repository of well-documented, reusable analysis pipelines so that teams no longer have to build everything from scratch.
“We think it’s inevitable that containers will precipitate the end of Hadoop because they cause you to completely rethink all the assumptions that [were the basis] of Hadoop in the early days,” said co-founder Joey Zwicker.
As an example, he notes that in Hadoop, people write their jobs in Java, and it all runs on the JVM.
“This was a great assumption at the time because Java was the leading platform, there were tons of people on it. But fast-forward to today, and now we have containers. They’re a much better tool for the job because you can use literally any tool in the whole world. We’ve built a system that allows you to put anything in a container and use it for big data,” he said.
Rather than being required to use Hadoop-specific tools, such as the Hadoop Image Processing Library, you can any of the existing image-processing libraries. You can use any open-source computer vision tool such as OpenCV, ccv, or VXL.
“We believe people are going to want to use the tools that are best in class. Containers allow them to do that,” Zwicker said.
Though it’s written in Go, data scientists can use any languages or libraries that best fit their needs, they say.
The two components Pachyderm developed for the stack are file system and pipeline system.
Pachyderm Pipelines is a system of stringing containers together and doing data analysis with them. You create a containerized program with the tools of your choice that reads and writes to the local filesystem. It uses a FUSE volume to inject data into the container, then automatically replicates the container, showing each one a different chunk of data. This technique enables Pachyderm to scale any code you write to process massive data sets in parallel, according to Zwicker. It doesn’t require using Java at all: If it fits in a container, you can use it for data analysis.
Pachyderm File System is a distributed file system that draws inspiration from git, providing version control over all the data. It’s the core data layer that delivers data to containers. The data is stored in generic object storage such as Amazon’s S3, Google Cloud Storage or the open source Ceph file system. And like Apple’s Time Machine, it provides historical snapshots of how you data looked at different points in time.
“It lets you see how things have changed; it lets people work together,” Zwicker said. “It allows people to not only collaborate on code but on data. One data scientist can build a data set, and another can fork it and build off of it, then merge the results back with the original one. This is something that has been completely missing from the data science tools out there.”
There’s no shortage of technologies — Spark, Pig, Hive and others — considered alternatives MapReduce, the processing layer in Hadoop.
“We think the existence of all those tools is an indication that MapReduce was the wrong idea to begin with. It was an overly constraining way of analyzing,” Zwicker said.
“What Hadoop found was that MapReduce could do a bunch of stuff, but they needed to invent other things on top of it, like Pig and Hive and those things,” Zwicker said. “Hadoop has something kind of like what we do, which is called Hadoop Streaming, but it’s a very second-class citizen that’s added afterward rather than us having our containerized workload be the core layer that everybody uses.”
Doliner adds that Spark and Hive and other tools are all still built on top of the core pieces of the Hadoop infrastructure, like Zookeeper, YARN, HDFS, pieces of the infrastructure that are among the weaknesses to Hadoop.
Docker Was ‘Aha’ Moment
Doliner and Zwicker founded the San Francisco-based company in 2014 and participated in Y Combinator in early 2015. It has raised $2 million from Data Collective, Blumberg Capital, Foundation Capital, and others.
It might appear nakedly ambitious to boldly state one’s plans to replace Hadoop — the founders contend they have the only company building something totally new.
“If you look at what [the others] are building, all of it is still the same Hadoop primitives repackaged in some way. We’ve believed from very early on that the problem isn’t that Hadoop isn’t packaged in the right way, but that Hadoop has inherent flaws,” Doliner said.
The company started out before Docker was released. The founders initially knew they wanted to build a replacement for Hadoop, but saw an early demo of Docker at their former employer, RethinkDB.
“That was the ‘aha’ moment,” Zwicker said. “We knew Hadoop was going to be replaced and saw containers are the perfect tool to do it. We knew they were going to create this whole ecosystem we could use to replace it. When we put all of that together, that is when things really started working for us.”
Adds Doliner: “We’re not just saying, ‘Hey containers are a hot new technology. Let’s take everything and shove it in a container’ and all of a sudden that’s a new product.”
By being early to the container movement, it’s all been evolving together, he said.
One of the key benefits of Pachyderm, they says, is that it doesn’t take a large team with specific expertise that Hadoop requires to be productive. That was an attractive feature for its customer Fogger, according to CEO Kamil Kozak.
Fogger makes a software platform for processing sensor data on industrial machinery such as solar farms and wind turbines. Its Fog Computing platform allows data processing on small Linux boxes close to the machines and pushes it over a peer-to-peer network to a central cloud hub. It uses Pachyderm for local data processing on it way to the cloud.
“At Fogger, we believe that containers are redefining infrastructure and that they will be used in all types of deployments,” Kozak said.
“Pachyderm has a very well-designed technological stack. We love the idea of map/reduce pipelines built with containers and a simple Git-like triggering system.
“We were evaluating having to build our own solution in-house or using something like Hadoop/Spark when I stumbled across Pachyderm. We chose Pachyderm because the learning curve and infrastructure overhead for Hadoop/Spark was significantly harder than Pachyderm; it just fit seamlessly into our containerized stack,” he said.
Containers allow Fogger to build data-processing algorithms in any programming language, “which simplifies our lives drastically as we don’t have to learn any new technology other than Pachyderm CLI itself,” Koziak said.