Apache Drill Eliminates ETL, Data Transformation for MapR Database
Hadoop distribution provider MapR is using the recently released Apache Drill query engine version 1.6 as its “unified SQL layer” for its converged data platform, to provide a tighter integration with the MapR-DB document database.
With the MapR-DB document database format plugin in Drill 1.6, a user can query JSON tables in MapR-DB directly, potentially eliminating the need for additional ETL (extract, transform and load) operations.
“You can have files, database tables, streams that are contained and managed through that converged platform and Apache Drill can be used to query across all the data regardless of where it’s located,” said Jack Norris, senior vice president of data and applications.
“Users can access those files directly with Drill, they can query them in the database tables; they can look at it through Hive. Regardless of how the data arrived or where the data is located, Drill is the SQL interface that allows access and queries directly on that data,” Norris said.
Six months ago, MapR released a developer preview of its JSON-based document database for use inside Hadoop. It announced MapR-DB document database capabilities as part of the MapR 5.1 release in March.
The Apache Software Foundation has elevated Drill to a top-level project in December 2014. It released version 1.0 last May.
Drill was designed as a schema-free SQL query engine for multiple data sources, including JSON, Parquet, and HBase. It not only allows rapid application development on Apache Hadoop, but empowers enterprise BI analysts to explore the data themselves — freeing IT staff from structuring the data for them.
Drill lets you analyze Hadoop data without ETL or creating schemas first; it generates schemas on the fly and keeps files in their original formats rather than converting them into tables or pre-specified formats before they’re loaded into the database system.
“The unique position that Apache Drill occupies is really in data exploration — to be able to support directly some of the most common formats out there that are also fairly difficult to query directly, things like JSON documents,” Norris said.
A web provider of bicycle equipment could, for instance, could offer a single search service that can both cover in-depth information such as documentation for the bikes, as well as returning results from simple product searches, such as from a catalog of accessories.
The information can be stored in a relational database, a NoSQL system such as HBase, a document database such as MapR, or even in the flat file.
The Drill 1.6 release includes performance enhancements including:
- Query planning speedups via early application of partition pruning.
- Enhanced stability and scale with an improved memory allocator.
- Faster query planning on Hive table queries.
- Optimized reading of Parquet metadata cache.
- And security through “client impersonation,” which Norris described as role-based views of the data without multiple different copies of it.
“Apache Drill is a game changer for us,” said Edmon Begoli, chief technology officer of PYA Analytics, a Tennessee-based advanced analytics company serving healthcare, defense and other industries.
“We’ve been able to query, in under 60 seconds, two years worth of flat PSV files of claims, billing, and clinical data from commercial and government entities, such as the Centers for Medicaid and Medicare Services,” Begoli said. “Drill has allowed us to bypass the traditional approach of ETL and data warehousing, convert flat files into efficient formats such as Parquet for improved performance, and use plain SQL against very large volumes of files.”
Feature Image: “Day 232 – Photo365 – Construction” by Makia Minich, licensed under CC BY-SA 2.0.