Development / Open Source

Pivotal Open Sources HAWQ to Build SQL in Hadoop

29 Sep 2015 6:00am, by

In keeping with its plans to open source its big data portfolio, Pivotal announced Tuesday it’s contributing its HAWQ SQL-on-Hadoop analytics engine and MADlib machine learning technologies to the Apache Software Foundation (ASF).

It contributed its distributed in-memory database Gemfire for incubation with ASF in March, a project called Geode, and plans to open-source its GreenPlum Database in about five weeks, a project it plans to manage itself, according to Gavin Sherry, head of engineering for data products.

While it opened up just the core of Gemfire, it will be releasing every line of code for HAWQ and the Query Optimizer it introduced in May.

“Pivotal HAWQ will become Apache HAWQ as of [Tuesday],” he said. “This is the first real step toward building SQL in Hadoop – not SQL on Hadoop. It will be Hadoop-native SQL.”

It will be governed by Apache, integrated deeply into the Hadoop ecosystem and be completely open source “in a way that is bigger than Pivotal,” he said.

The difference between an “on-Hadoop” solution and “in-Hadoop” will be in solving problems, he said, related to:

  • Connective technology that diminishes responsiveness or scalability.
  • Proprietary technology.
  • Runtime technology not based on core modules of Hadoop, such as YARN.
  • Operational experience that is counterintuitive to users.
  • Questions about governance of the project or influence it has outside the scope of a single vendor.

These will be the works of Apache HAWQ, he said. By turning it over to ASF, there will be contributors from many companies, it will be open source, and built around the fundamental building blocks of Hadoop, including HDFS, YARN and Ambari.

Collaborative Project

When it announced in February plans to open-source its big data portfolio, Pivotal also unveiled an initiative called the Open Data Platform (ODP). It’s a move with other vendors to standardize around core technologies including Apache Hadoop 2.6, inclusive of HDFS, YARN and MapReduce as well as Apache Ambari software for managing Hadoop environments at scale.

That effort has its detractors, including Cloudera and MapR, which calls the effort redundant and “vendor-biased.”

However, in a separate move Monday, the Linux Foundation announced it will manage the effort, now known as ODPi, as a collaborative project. The number of members has doubled, and now include the likes of SAS Institute, Splunk, Squid Solutions, SyncSort, Telstra, Teradata and others — more than 25 companies in all. It has released an initial ODPi core specification and plans a certification program.

Two of those companies, Hortonworks and Altiscale, will be partnering with Pivotal on commercial HAWQ offerings — Altiscale will be providing HAWQ as a cloud service.

Standalone analytics platforms are challenged as they lack key context for the data. Their success will be to marry the technology with applications that are sources of the data, according to analyst Tim Crawford, a CIO strategic advisor with AVOA, adding that the hurdle for them is pretty high.

The main reason he sees to open-source a big data platform is to create a greater community around the technology.

“We have made commitments to the Apache Software Foundation that we will continue to contribute [to HAWQ] and plan on having a thriving community around this,” Sherry said. Pivotal booked more than $100 million in subscription-based revenue from the big data portfolio in 2014, he said, and will continue to provide services around HAWQ.

In addition, Pivotal is contributing the MADlib machine-learning library, a set of powerful scale-out, parallel machine learning algorithms developed by Pivotal as well as researchers from the University of California, Berkeley; Stanford University; the University of Florida and Pivotal’s customers.

By being accessible via SQL, they can integrate through visualization and other higher-level tools and be used by analysts who are familiar with SQL, but not the details of numeric programming and statistics necessary to implement such algorithms, and who are unfamiliar with distributed systems approaches, he said.

The algorithms have been used broadly in the finance, automotive, media, telecommunications and transport industries. The MADlib library supports HAWQ, Pivotal Greenplum and PostgreSQL.

“The Crown Jewels”

HAWQ grew out of a multi-year effort by the EMC data warehousing and Hadoop unit to take the scalability and replication benefits of the Hadoop Distributed File System (HDFS) and in effect, get it to speak SQL.

Greenplum co-founder Scott Yara called HAWQ the “crown jewels” of this effort at its launch, a term Sherry still uses.

With HAWQ, you can write any SQL query and have it work on top of Hadoop rather than trying to replace HDFS with a NoSQL data store. So far it’s been built to scale on top of – rather than inside of — Hadoop, on hundreds to thousands of server nodes.

HAWQ and the other Greenplum assets became part of Pivotal when it was spun off in 2013.

In May, it announced an upgrade to its Big Data Suite that it said provided up to 100 times performance improvements the Pivotal Query Optimizer added to Greenplum Database and HAWQ. The Optimizer, which Sherry said represented five years of work, was designed to effectively determine the cost of processing a query across a number of machines and processors in a cluster.

In the meantime, it made HAWQ available on the Hortonworks Data Platform (HDP), its first venture outside the Pivotal ecosystem.

Pivotal is a sponsor of The New Stack.

Feature image: “hawk stalk” by David Offf is licensed under CC BY 2.0.

A newsletter digest of the week’s most important stories & analyses.