Apache Arrow’s Columnar Layouts of Data Could Accelerate Hadoop, Spark
Last week, the Apache Software Foundation made Arrow, an in-memory columnar layer, a top-layer project. The open source software promises to accelerate analytical processing and interchange by more than 100 times in some cases.
Initially seeded by code from Apache Drill, Apache Arrow is now a collaborative project across 13 other open source projects, including Drill, Cassandra, Hadoop and Spark, are involved in the Arrow project, a factor in convincing ASF that it should bypass the incubator period and be named a top-level project straightaway.
In introducing Arrow, Jacques Nadeau, the project’s vice president, makes clear that it is not a storage or execution engine. It is a technology to be embedded within the other projects, he said.
“You can think of [Arrow as] kind of like an embeddable accelerator and a way of moving data between systems, so it’s designed to be a foundational component for execution engines and storage systems. Right now, they’re all building their own things to try to solve this, but their own things are row-focused. We’re trying to come together and have a single way of doing this as in-memory and columnar,” he said.
Several of the projects have talked about this was one of the next things they wanted to do and the project leaders agree that in-memory columnar was the way to achieve performance gains, according to Nadeau, CTO at Dremio, the stealth startup also behind Apache Drill.
“We’ve seen segmentation and multiple technologies trying to compete with each other that’s confusing users,” said Nadeau.
“Arrow is not replacing anything that exists today. It’s also not a technology that users have to directly adopt. It’s a technology designed to enhance the technologies they already use.
“What they should get first and foremost at the end of the day is improved performance, which everybody cares about,” Nadeau said. “Vendors say how great they are at in-memory and that kind of thing. Making these engines in-memory columnar engines makes them [vendors] extremely interested. But people have all these choices, and what happens is that it’s very expensive if you want to use multiple versions of these things. If you want to use Pandas and Spark and Drill all in the same ecosystem, you have two challenges: it’s hard to move data between those systems, and this solves that, and the second is that this is designed specifically so you can avoid multiple copies of data in memory,” he explained.
Single Representation of Data
“One frequent situation is you might have your business analysts using Tableau against Drill and you might have your data scientists using something like Pandas – what happens is you have all these applications that are trying to access the same data. They all have to load it into memory, then you have to have memory for each of those systems. But with Arrow, you can hold one representation of that data in memory, which means you can have either less hardware or you can have a larger active data set in memory,” Nadeau said.
“Typically, no one uses one tool. There are different tools that are the best fit for different use cases. By being able to use a single memory layer underneath tools, you don’t have to pay the memory cost multiple times,” he said.
In many workloads, 70 of 80 percent of CPU cycles are spent serializing and deserializing data. Arrow solves this problem by enabling data to be shared between systems and processes with no serialization, deserialization or memory copies.
By providing a canonical columnar in-memory representation of data, each programming language can interact directly with the raw data.
Drill, Impala, Kudu, Ibis, and Spark will become Arrow-enabled this year, with other projects expected to follow.
All the open source big data projects have to deal with the effective use of modern multi-core processes including minimizing the L1 and L2 cache misses that plague applications; Arrow is an attempt to bring all these minds together, according to Monte Zweben, co-founder and CEO of Splice Machine, makers of a relational database management system powered by Hadoop and Spark.
“The next generation processors will have even broader support for SIMD (Single instruction, multiple data), which will increase parallelism and soon we will have the ability to have all data stored in non-volatile, byte-addressable RAM. This means no disks. So optimizing the representation of data to get the most throughput out of the processors will bring fantastic performance improvements on many workloads,” he said.
In the business computing space, the low-level data engineering tier of “big data” is dominated by projects like Hadoop and the Apache ecosystem of Java-based projects mostly concerned with traditional data warehousing and stream processing needs, and support analytics by simply providing SQL query interfaces, according to Peter Wang, co-founder and CTO of Continuum Analytics, a Python-based analytics platform.
“Arrow definitely constitutes a new capability for this ecosystem, where advanced simulation, optimization and ‘hard math’ have actually not been very performant,” he said. “And make no mistake: Those needs are not a small or niche minority of use cases, but rather sit at the heart of every aspect of machine learning, deep learning, graph analytics and the like.”
Most of what passes for business analytics is really just “counting things,” he said
“But when the physicists at LIGO (the Laser Interferometer Gravitational-Wave Observatory), need to analyze the data from colliding black holes to see if Einstein was right, they use Python,” Wang said.
“The reason for this is because Python wraps pleasant and efficient interfaces over high-performance low-level algorithms in C, C++ and FORTRAN. These non-Java languages interoperate well because they can seamlessly pass control between each other, without having to copy or reformat gigabytes or terabytes of data,” Wang said.
“For decades, the mantra in this realm of advanced analytics has been ‘minimize data movement,’ and its practitioners look upon the sheer amount of data copying back-and-forth in Hadoop with horror. That’s one of the reasons Java and its entire ecosystem is an outsider — or second-class citizen, at best — in the numerical computing space,” he said.
“Arrow is exactly the right thing!” he says. “It addresses this core weakness in the Java and Hadoop ecosystem by building an efficient, highly-compatible representation at the lowest level of the data stack. Even newer ‘efficient’ representations in Hadoop like Parquet and Avro are not sufficient for this because the most important attribute is not mere efficiency within the Hadoop stack, but rather interoperability with non-Hadoop software ecosystems.”
Arrow should open the doors for data scientists to efficiently run advanced algorithms from Python and C on native Hadoop data, according to Wang.
“I’m personally very excited about working [together in the industry] to form a more perfect union between these ecosystems that should have always worked better together from the outset,” he said.