What news from AWS re:Invent last week will have the most impact on you?
Amazon Q, an AI chatbot for explaining how AWS works.
Super-fast S3 Express storage.
New Graviton 4 processor instances.
Emily Freeman leaving AWS.
I don't use AWS, so none of this will affect me.
Containers / Storage

Docker Will Change Hadoop, Making it Easier and Faster

Jun 8th, 2015 8:23am by
Featued image for: Docker Will Change Hadoop, Making it Easier and Faster
Feature image via Flickr is licensed under CC BY-SA 2.0.

Hadoop Summit starts this week and with it more discussion about how new platforms such as Docker are changing the scope for how data analytics are perceived in a Hadoop context.

BlueData, which offers an infrastructure software platform for big data, has added support for Docker containers with a free version of its EPIC platform called EPIC Lite. It allows users to spin up virtual Hadoop or Spark clusters in Docker containers on a laptop.

The company also announced its “summer release,” version 1.5 of EPIC, which provides support for newer Hadoop and Spark versions, integration with Apache Ambari and Cloudera Manager, support for common big data analytics applications and “bring your own app” capabilities.

VMware veterans Kumar Sreekant and Tom Phelan founded BlueData in 2012 with the hope of creating an Amazon EMR-like (Amazon Elastic MapReduce ) self-service model for big data infrastructure in private data centers. The company has raised $19 million and emerged from stealth mode last September.

Its EPIC software solution — not to be confused with the healthcare EHR giant — runs on any hardware, any server and any storage environment. It’s aimed at the bare-metal deployments of Hadoop that enterprises run on-premises.

Its aims to simplify Hadoop for its customers, with technology that addresses I/O performance issues, allowing the separation of compute and storage, and offers tools to manage multi-tenant environments in a virtualized infrastructure.

The company professes it’s embracing Docker because it wants to bring the benefits of virtualization for big data applications while delivering the simplicity of containers and the performance of bare-metal servers.

At the same time, it admits to another motive.

“We wanted to make sure developers and data scientists can create their own cluster, which is very hard today … To be able to point to any data and do analysis,” said Anant Chintamaneni, the company’s vice president of product. Those folks “can put it on their laptops and quickly gain the functionality of Cloudera or Hortonworks.”

He said the company wanted to give users access to the software from a laptop, then if they like it, they’ll be asking the IT bosses for it.

“Docker is the best show in town right now in terms of maturity …” Chintamaneni said. “Docker was a means to an end for us to able to deliver the user experience on a laptop or a VM where you can [create] a cluster with more than one node. Data scientists want more than one node to check algorithms. You want to know how your application will behave in a real-world cluster.”

The enterprise version was designed for multiple tenants and multiple users. Lite contains fewer images because the company wanted to keep it lightweight.

“As more and more people embrace Docker containers, it makes sense for companies like BlueData to add Docker to their set of supported hypervisors,” says Tomer Shiran, VP of product management at MapR and a member of the Apache Drill Project Management Committee.

“Docker containers provide better I/O performance than traditional VMs, so I expect these Hadoop clusters to run faster when deployed on Docker.”

BlueData doesn’t see virtualization going away, but like VMware, which is embracing container technology and integrating it into its own portfolio, sees the momentum behind Docker and is trying to stay ahead of it.

“We see containers as another way to achieve the benefits of virtualization,” said Jason Schroedl, vice president of marketing.

“Our plan is to develop a big data platform that can run in any virtualized environment, and we believe there’s benefits to containers we can bring to our customers.”

He said the company so far is not seeing a lot of demand from enterprise customers for the software to run in Docker, but expects to see more in the future.

Docker is maturing inside enterprises, he says, and more Docker-native private and public cloud platforms are emerging where Hadoop is becoming a key service to be offered, according to according to Tim Hall, vice president of product management of Hortonworks.

There will be three sessions involving Docker in conjunction with Hadoop at Hadoop Summit, which opens tomorrow in San Jose, Calif. For example, Sidharta Seethana of Hortonworks and Abin Shahab of Altiscale will discuss Apache YARN and the Docker ecosystem.

Altiscale, which provides a a Hadoop-as-a-Service offering, decided to run Hadoop in Docker containers, though it would mean its systems would need to provision and manage Docker containers directly without the benefit of YARN, the data processing framework introduced in Hadoop 2.0. However, the company found this approach to be both repeatable and automatable.

Pachyderm aims to make big data analysis simpler by providing a MapReduce alternative to the Hadoop stack using Docker. It builds on CoreOS’s Fleet and etcd, rather Apache tools such as YARN and Zookeeper.

There are two ways to tackle this and Hortonworks is doing both, according to Hall.

The first is about using Docker to host Hadoop.  It’s doing that with Cloudbreak, from its recent acquisition of Sequence IQ, and Hortonworks Data Platform (HDP).  It uses Docker images and launches HDP on any major cloud platform, including Microsoft Azure, Amazon Web Services, Google Cloud Platform.

The second is about using Docker containers for app deployment via YARN. This has been a technical preview feature of HDP 2.2 and customers have been exploring how to take advantage of it.

“Essentially, Docker provides a wonderful way to isolate and package apps for Hadoop. We are also looking at how the Slider framework and Docker can work together to ease these kinds of deployments,” Hall said.

“One of our customers is considering a similar approach for their data platform, using HDP support for Docker — below and above. They are using Cloudbreak for deploying Hadoop on Docker in the cloud and plan to build their data applications as Docker image and run them on top of YARN. There are many other customers and partners adopting our Hadoop on Docker for environment-agnostic deployment. The main driving force behind this architecture choice is agility, speed of innovation and consistency.”

The other more traditional approach is using Hadoop with virtual machines or OpenStack.

He says the benefits of running Hadoop on Docker include:

  • Quick installation (pre-pulled RPMs).
  • Same process/images for development/QA/production.
  • Same process for single/multi-node.

Benefits of running Dockerized apps on YARN include:

  • Better software isolation.
  • Same process/images for development/QA/production.
  • Better distribution and versioning of apps.

Big data app developers are increasingly leaning toward containerization of their apps using Docker, he says. In addition, there is quite a bit of interest to run Docker on top of bare metal, rather than VMs, to provide a cost-performance ratio for Hadoop-like applications. In addition, YARN is becoming an application deployment platform beyond big data apps, he says. This is driving interest for supporting native container deployment support in YARN and a requirement of an app management framework on top of YARN.

Group Created with Sketch.
TNS owner Insight Partners is an investor in: Docker.
THE NEW STACK UPDATE A newsletter digest of the week’s most important stories & analyses.