Data

Airbnb’s AirPal Reflects New Ways to Query and Get Answers from Hive and Hadoop

9 Mar 2015 4:06pm, by

Airbnb’s data stores are approaching 1.5 petabytes in accumulated size — a mere drop in the bucket compared to Facebook’s 300 petabytes, but a colossus, nonetheless. When Airbnb needed a tool for querying and visualizing that less-than-infinite pool of data, it built one for itself called AirPal. The visualization tool, which it has open sourced, examines records from clusters numbering in the tens of petabytes.

It’s the story about AirPal’s development that says a lot about how the Internet scale companies work together quite a bit to build out their own data infrastructures. These are companies that have had to build their own tools that are far more suited to their needs than what would come from Oracle, Microsoft or any of the other database and data warehouse giants. AirPal demonstrates how the tools being invented are now past the first generation, and enjoying a maturity that allows thousands of engineers at Airbnb, Facebook, Dropbox and a host of others to store, manage, and utilize its data.

Awakening the Sleepy Giant

Both inside Airbnb’s data centers and outside, “bnb” stands for “bed and breakfast.” It’s the service mobile device users rely upon to find lodging from good people who offer it. It’s also a place where a former data visualization specialist from NASA can feel right at home.

Airbnb is not just an app but a payment system, facilitating monetary transactions between hosts and lodgers. It’s a platform for what’s being called a sharing economy. In order for Airbnb’s employees to be able to track what happens within that economy, the company’s developers — including software engineer Andy Kramolisch, formerly with NASA’s Langley Research Center; and product manager James Mayfield, formerly with Facebook’s database team — built their own front end for the distributed data query engine that oversees their Hadoop clusters.

“I’ve always been a person who tries to make sure that decisions happen at the right level of the organization,” says Mayfield in a conversation with The New Stack. “Part of that is making sure people are informed by data to make the best decision possible.” Throughout his career, including at Facebook, he says his goal has been to promote a distributed model of data.

Part of that promotion, he tells us, involves building a kind of nerve center “where people feel that they can go and write queries and answer questions for themselves, rather than an environment where there is some select, anointed set of people who have to answer the questions for everyone.”

Employees tend to see data as a place, rather than a device, an engine, a shipping warehouse, an airport runway, or any of the other metaphors applied to data stores in just the past decade. Software engineers who worked at the time with Yahoo first constructed a system where all the data everywhere could be treated as one place.

But it was a key contribution by Facebook that made this place somewhere you could visit.

The Old, New Hive

Introduced way, way back in 2009, Hive was the first Hadoop data warehouse that could be queried like any other data warehouse. It demonstrated that there was very little truly qualitative advantage to a giant relational database over a collective data store.

Facebook may be in a unique position among heavy data users. While it often brags about the size of its Hive clusters, Mayfield (a former Facebook engineer) noted it actually has the luxury of being able to store all that data forever. A smaller institution that deals with a mere one or two petabytes has to have a better defined strategy.

So Airbnb stores its newest data in Hadoop Distributed File System (HDFS) clusters on EC2 instances, where latency is lower. At a set interval, then, it “retires” aging storage from the “hot” zone to “cold” zone S3 storage. Mayfield says this strategy helps reduce Airbnb’s total data set size.

“Like the big Amazon users, like Dropbox and Netflix, we all talk regularly and we all have the same approach,” says Kramolisch. “Some folks have smaller HDFS clusters and some have larger ones, but the basic idea of keeping the bulk of your historical data in S3 is like a tried-and-true practice.”

Hive was Airbnb’s original single source of truth. But due to speed and latency issues, it later switched to Amazon Redshift, which consumes data from S3 buckets.

“That had its own problems,” he relates, “such as a limited number of concurrent queries, a painful ETL [extract/transform/load] process to basically duplicate all of our data from Hive to Redshift every night, which more or less doubled the amount of work our systems had to do.”

Bed and Breakfast to the Rescue

It was Facebook that perceived this problem before most of Hive’s other users. As the social giant’s software engineer Martin Traverso explained at a company conference in 2013, “The problem with Hive is that it’s designed for batch processing.” His presentation cited an unnamed Facebook data scientist as saying, “A good day is when I can run six Hive queries.” Other data scientists in the room chuckled in agreement.

Facebook would be first to create a solution to the problem caused by Facebook’s first solution just a few years earlier. At first it developed PrestoDB, a SQL query engine that aggregated data from multiple sources, including Hive, but also including Cassandra. The company then actually approached Hive’s first-tier customers, according to Mayfield, including Dropbox, Netflix and Airbnb, with an offer to become a PrestoDB early adopter, even before it released the query engine to the open source community.

The move to PrestoDB worked out very well for Airbnb, he says, the key reason being because Redshift was not open source. Having access to PrestoDB’s source code enabled Airbnb to debug issues early on, and send its patches back upstream.

But there was something missing. Hive did have some front-end query UI tools available, such as Hue, but PrestoDB did not.

While at Facebook, Mayfield used a front-end for Hive called HiPal. It’s an internal part of Facebook’s infrastructure stack, and will not be open sourced. Still, Mayfield says, Airbnb paid homage to HiPal by dubbing its visualization tool AirPal.

It gives users a simple Web interface to search not just through data tables, but also metadata and schemas. It’s capable of creating new tables in Hive based on results filtered from retrieved data, or it can stream the results of queries into a CSV document that can be opened in Excel.

“When Andy and I were building this at Airbnb, we actually went to Facebook and met with the Presto team and the data tools team there to show them our work and talk to them about the possibility of open sourcing AirPal. And they were actually really supportive of the idea. Part of the reason was that they wanted a tool similar to HiPal to exist out in the world, but it likely wasn’t part of their roadmap to put in the work to make that happen.”

As Mayfield wrote for Airbnb’s corporate blog, “We stood on the shoulders of giants to make this tool and we appreciate the influence and input that the data infrastructure and data tools teams at Facebook were able to provide.”

Financial publications telling the story of Airbnb’s rapid rise to prominence refer to it as a “Web site.” That has the ring of calling the Chicago Cubs an “athletic club.” More accurately, Airbnb has suddenly become the world’s data visualization specialist. Incidentally, you can use it to book a room.

Featured image via Flickr Creative Commons.