Dremio Wants to Be the Splunk of Big Data
Mountain View, Calif.-based Dremio emerged from stealth on Wednesday, aimed at making data analytics a self-service. It’s a similar goal of Qubole, though the two startups are taking different approaches.
Essentially, Dremio aims to eliminate the middle layers and the work involved between the user and the data stores, including traditional ETL, data warehouses, cubes and aggregation tables.
Two-year-old Dremio’s founders are Tomer Shiran, former vice president of product at MapR, and Jacques Nadeau, who ran the distributed systems team at MapR. Both have been active in open source projects: Shiran founded the Apache Drill project and Nadeau is creator and project management chair for Apache Arrow.
“We created the company because we believe there’s a massive opportunity for disruption here,” Shiran explained. “Think about what Amazon was able to do for application developers… Ten years ago, if you were an application developer, you were really reliant on IT to go buy and set up resources for you. Amazon created a solution that put developers in the driver’s seat. It gave developers the ability to get their own resources and their own hardware, and they can do it almost instantaneously, in a minute.”
That’s what Dremio aims to do for business analysts and data scientists.
They have raised more than $15 million from Lightspeed Venture Partners and Redpoint. Its management team includes big data and open source leaders from Hortonworks, Mesosphere and MongoDB.
Dremio connects to all of an organization’s data sources, data lakes, databases, taking care of everything in the middle. The Arrow-based execution engine leverages columnar-based memory to execute a query that runs on a single source or on data between different sources.
It also optimizes the data itself, similar to the way Google optimizes data in various data structures so that search queries can be very fast, Shiran said. It calls these data structures “Reflections.”
And it has a user interface much like Google Docs, except for data sets rather than documents. The users themselves can see the data and explore it. They can create new data sets by doing live data curation. They can interact with the data visually or through SQL.
“Everything under the hood is standard SQL, and more technical users can do anything in the power of SQL. You can create new data sets, share them with colleagues. There’s an entire data catalog in there.
“Then with a click of a button, you can launch any of these BI tools, connect it to the Dremio cluster already and start playing with the data inside Tableau without extracting any data. There are no copies of data. All the data sets and curation inside Dremio are all virtual. It’s all done at the logical layer. All the current solutions are based on data copies, and Dremio is the opposite of that,” he said.
Because the major BI tools are based on SQL, Dremio forms a bridge between NoSQL databases such as MongoDB, automatically learning the implicit schema from various systems even when they don’t have an original schema.
“It’s kind of what Splunk did with logs,” Shiran explained. “It wasn’t that people weren’t analyzing logs before, but they were using a lot of command-line tools and loading logs into relational databases — it was just a lot of manual work. Splunk designed a solution specifically for log analytics and made it so you don’t have to glue together all these tools in order to analyze your logs.”
Dremio is designed to scale from one server to thousands of servers in a single cluster. It can be deployed on Hadoop or on dedicated hardware. With Hadoop, it recommends deploying Dremio on the Hadoop cluster so raw data is local in the cache.
There are two roles in the Dremio cluster:
- Coordinators that coordinate query execution, managing metadata and managing the UI.
- Executors, which process queries.
By deploying coordinators on edge nodes, external applications such as BI tools can connect to them. Coordinators use YARN to provision the compute capacity to the cluster, eliminating that need for manual deployment. The company recommends one executor on each Hadoop node in the cluster.
Dremio, in effect, is an extension of their open source work. Drill is a single SQL engine that can query and join data from myriad systems. Dremio uses Apache Arrow (columnar in memory) and Apache Parquet (columnar on disk) for high-performance columnar storage and execution.
Dremio looks like a single, high-performance relational database to any tool. You just send standard SQL queries. Meanwhile, Dremio automatically optimizes the physical organization of your data for different workloads in a cache, or it queries your data sources directly when you need access to live datasets.
It uses a persistent cache that can live on HDFS, MapR-FS, cloud storage such as S3, or direct-attached storage (DAS). The cache size can exceed that of physical memory, an architecture that enables Dremio to cache more data at a lower cost, producing a higher cache hit ratio compared to traditional memory-only architectures, according to the company.
It also offers native query push downs. Instead of performing full table scans for all queries, Dremio optimizes processing into underlying data sources. Dremio rewrites SQL in the native query language of each data source, such as Elasticsearch, MongoDB, and HBase, and optimizes processing for file systems such as Amazon S3 and HDFS.
Its Data Graph preserves a complete view of the flow of data. Companies have full visibility into how data is accessed, transformed, joined, and shared across all sources and all analytical environments.
Open Source Model
Dremio comes in an open source Community edition and an Enterprise edition. The Enterprise edition includes connectivity to enterprise data sources such as IBM DB2, as well as security and governance capabilities.
It can run on-premises or in the cloud. There are advantages to running Dremio in the cloud, such as you can store those reflections, the optimized data stores, on S3 directly, Shiran said.
“It’s a fully managed cache and you can scale your compute capacity independent of that. Say after a Black Friday, you need more analytics capacity, you spin up a few more Dremio instances, and you spin in down when you don’t need it,” he said.