Apache Fluo Speeds Small Updates to Big Data
Google originally struggled with the need to have an up-to-date index of the web when new documents were continually arriving. Updating its search index used to involve a series of MapReduce jobs that took three days until it created the Percolator update technology for its Bigtable data analysis service.
An open source version of Percolator, Apache Fluo, similarly allows concurrent small updates without the latency involved in reprocessing the whole dataset.
Fluo recently was named an Apache Software Foundation top-level project.
Fluo works with Apache Accumulo, a distributed key/value store based on the design of Bigtable for managing massive amounts of structured data. Accumulo stores data in Apache Hadoop‘s HDFS and uses Apache Zookeeper for consensus.
“Percolator and Fluo allow you to do things you could do with MapReduce, but do them with lower latency, so you have a quicker turnaround time when data is updated,” explained Keith Turner, vice president of Apache Fluo.
There are two main components Percolator gives a user, according to Turner. One is cross-node transactions, the ability to change data residing on multiple machines in a cluster. And either all of it goes through or none of it goes through.
In Bigtable, you can only get transactional behavior on data stored on a single machine, he said.
The second thing is called Observers, which allow you to plug in user code when something changes.
“You can kind of register, ‘Oh, something happened with this bit of data, and I want some user code to run a transaction against that row-column in the database,’ and that transaction can read other row-columns that may be on other nodes, write to other row-columns stored on other nodes and it can set other notifications that will trigger other Observers later,” he explained.
“So the incremental workflow comes from these two basic premises… You put together the transaction and the Observers and you have these changes that percolate through the system.”
In BigTable, you do transactions with data on one node.
“If I’m a company storing customer data by customer ID in Bigtable, but I want to add new information to the customer ID, you’ve got a join there to add new information to existing information. What you can’t do in a sane, transactional way is, say, when a customer changes, I want it to turn around and update products that customer has bought in the past and update stores they’ve shopped at in the past,” he said.
“That information would be stored on other computers. I want to do that in a transactional way so that all of it goes through or none of it goes through. You can store all the information about a customer together, all the information about products together, and all the information about stores together, but when you want changes in one to affect changes in other ones, you have to do something more than what Bigtable offers.”
The project released Version 1.1.0 in late June that included a new Observer API that includes the ability to use lambdas. The API previously required configuring an Observer class for each observed column, which was cumbersome and made using lambdas impossible. With the new API, you only have to configure a single class that provides all Observers, which can be assigned to a column. It also improved integration with Spark.
“Apache Fluo is a very clever piece of software, elegantly supplementing Apache Accumulo’s ability to store and maintain very large indexes,” said Christopher Tubbs, a committer on both the Accumulo and Fluo projects. “Its support of transactions enables Accumulo to solve a whole new set of big data problems, and its observer framework makes designing ingest workflows fun.”
The biggest issue the Fluo project is working on is decoupling it from YARN, Turner said.
“It comes out of the box with support for launching it with YARN. We’re trying to rip that out of the core project and make it its own subproject. But while we’re doing that, we want to make it easier to launch in like Kubernetes and Mesos. Right now we’re very tightly coupled to YARN. We want to move away from YARN as the only option to making YARN one way to run Fluo,” he said.
As he’s also involved in the Accumulo project, he said he’d like to work on increasing throughput by making Accumulo support asynchronous read/write operations.
You can learn more about the Apache Fluo project at the Accumulo Summit on Oct. 16 in Columbia, Maryland.
Google is a sponsor of The New Stack.
Featured image via Pixabay.