For companies trying to manage an array of storage options, Alluxio provides a way to herd all these repositories into the same corral for their big data projects.
Alluxio has become one of the fastest-growing open source big data projects, with more than 500 contributors after four years, CEO Haoyuan Li told an audience at the Vault big data conference in March.
Traditionally, the big data ecosystem has been MapReduce for compute and HDFS for storage, but now there are many different choices for storage, all with different properties. As a result, enterprises are experiencing silos from all the different storage systems, and they become hard to manage.
Many of these types of storage were not built for these types of workloads, so performance becomes a big issue as well, he said. Alluxio aims to be a way to unify data from all the different storage systems and present a unified view on the global namespace to the up layer of applications and at the same time enable operators to quickly access the data.
“We put Alluxio between the compute and storage systems. We unify the data from different storage and present this global namespace to the upper-level applications to enable them to interact with the data at memory speed,” he said.
In essence, it’s a programmable interface — distributed node-based memory — between compute frameworks like Spark and MapReduce and the underlying storage systems. It then uses a tiered storage architecture that caches the most often-used data in memory, with less-often-used data on SSDs and traditional hard drives.
As Jowanza Joseph, senior software engineer at One Click Retail, put it:
“Ideally, we’d have some way of specifying which data we’d want to keep on the cache when to release it and to be able to plan around that. Alluxio is exactly this, with a sophisticated API and support for many data stores out of the box.”
Alluxio supports a range of storage systems, including Amazon S3, Google Cloud Storage, Gluster, Ceph, HDFS, NFS, and OpenStack Swift.
The cache functionality helped Barclays to reduce its workflow iteration time from hours to seconds.
“Even though Spark provides a cache functionality, every time we restart the context, update the dependency jars or re-submit the job, the loaded data is dropped from the memory and the only way to restore it is to reload it from the central warehouse,” it states in a white paper.
At Vault, Li highlighted how Chinese travel site Qunar uses Alluxio to manage data across disparate storage systems, and how at Chinese search firm Baidu, batch queries that previously took 15 minutes now take less than 30 seconds.
Baidu manages an Alluxio cluster that scales to 1,000 nodes and more than 2TB, including 50TB of memory storage and the balance on disk.
From Tachyon to Alluxio
Alluxio began as a project at the University of California Berkeley AmpLab around 2012, originally called Tachyon. It was open sourced in 2013 and renamed in 2016. Version 1.5 is due out this quarter, Li said.
The commercial enterprise edition was unveiled in January; a free community edition can be downloaded from the Alluxio website.
Among Alluxio’s differentiators, according to an Evaluator Group report, it uses re-computation of log data to provide fault tolerance rather than creating three distributed copies at ingest, as distributed file systems typically do. That improves performance and means it can rebuild data sets from a point in time from before a failure.
Gartner compared it to other Hadoop operations providers Attunity, BlueData Software, DriveScale, GridGain Systems, Pepperdata and others. It’s a market struggling with a skills gaps and technology immaturity, the analyst firm noted.