It’s a needle-in-a-haystack problem, except you’re looking for multiple needles that match a specific type. How do you approach the haystack? In your mind, you could subdivide the barn into blocks, and each block into cubes. Then you could search each cube in a flat, and each flat in a block, left-to-right, top-to-bottom, in sequence. And you’d be slower to reach a conclusion than Congress on net neutrality.
If all you need is a reliable estimate of how many needles there actually are, then you could employ a sampling strategy. The simplest strategies assume that the distribution of needles is always even. But in practice, applying a simple sampling strategy may not help you estimate the number of needles of the specific type you’re looking for. Still, you might not prefer to evaluate the types of needles until you had isolated all the needles from the hay.
At least, this is what logic would tell you. Predicate logic, built upon the foundation first laid by Dr. Edgar F. Codd in the 1960s, does not exactly work that way. It may prefer that you isolate the exact needle type from both the hay and the other needles, and it may apply that context to the entire barn before it lets you start subdividing by block.
So as you can imagine, applying the kinds of queries created for relational database systems to huge, semi-structured datasets can yield either poor results or slow results — take your pick.
PrestoDB, as we discussed in a previous post, is Facebook’s homegrown answer to the problem of applying rapid queries to very huge datasets stored in Hadoop’s HDFS. Facebook’s engineers found themselves building PrestoDB (or just “Presto”) for the simple reason that Hive, the query engine for Hadoop, was way too slow at handling huge datasets.
What made the development of Presto even faster for Facebook was the fact that its developers shared their experiences with the greater open source community. It’s still actively being developed. And as Facebook announced recently, it has replaced a key component of Presto in the interest of attaining greater speed.
Such an announcement sounds like automobile engineers trumpeting the replacement of a carburetor to achieve greater displacement. But now that a growing number of the world’s huge workloads rely upon Facebook technology, we could use the new carburetor, please, thank you.
Specifically, it’s Presto’s Optimized Row/Columnar reader, which extracts records from Hadoop’s HDFS file format. In a blog post, Facebook software engineer Dain Sundstrom explains that his team had been using “a fork of ORC” (he might have written the whole post just so he could say that) called DWRF. But Presto’s analytics tasks required columnar reads, which you can liken to the ability to stack all the needles up without having to extract each one individually from the hay. DWRF extracted rows, or records, not columns.
DWRF Pushdown: This Time, It’s Personal
What’s more, DWRF did not support a key feature Facebook desperately needed, called predicate pushdown.
We asked Arshak Navruzyan, a vice president of product management at Argyle Data, to explain what this is. Argyle is one of PrestoDB’s principal third-party customers, and a contributor to the Presto open source project. It uses Presto in the course of providing real-time fraud detection services for major telecommunications providers. You may find our story about Argyle’s use of Presto here.
Navruzyan told us that Argyle’s connector between Presto and Apache Accumulo (a distributed key/value store) does use predicate pushdown. “So the WHERE clause of the query actually gets pushed down into the index store, into Accumulo,” he says. “Then Accumulo is able to do a range query.”
The pushing down is a kind of delay, but of exactly the right thing to delay. It translates to: Don’t start looking for the precise thing until the general things have been isolated first.
One of Presto’s key strengths, says Navruzyan, is the ability to conduct approximate queries, building on the work of researchers at UC Berkeley with their BlinkDB project [PDF]. A few years ago, it was this team which first tried to balance the needs for efficiency and practicality — to come up with a reasonable compromise.
“None of the previous solutions are a good fit for today’s big data analytics workloads,” the Berkeley team wrote in 2013. “OLA [online aggregation] provides relatively poor performance for queries on rare tuples, while sampling and sketches make strong assumptions about the predictability of workloads or substantially limit the types of queries they can execute. … BlinkDB allows users to pose SQL-based aggregation queries over stored data, along with response time or error bound constraints. As a result, queries over multiple terabytes of data can be answered in seconds, accompanied by meaningful error bounds relative to the answer that would be obtained if the query ran on the full data.”
It’s part of the inspiration behind Facebook’s work, which has evolved at least a few generations since Berkeley. With Presto’s new ORC reader grafted in place, end-to-end, single-threaded queries against a ZLIB-compressed dataset in memory now run, by Facebook’s calculations, from 3.5 to 4 times faster than with DWRF. (Your results, Facebook’s Sundstrom warns, may vary.)
“We are always pushing the envelope in terms of scale and performance,” wrote Sundstrom. Actually, he may be understating his case. The envelope, such as it was, was left behind several miles back. In the rearview mirror, you can still see the dust, along with some flecks of hay and more than a few bent and broken needles.
Featured image via Flickr Creative Commons.