Data delayed is data denied.
That was a slogan from the team that built out a self-service data platform at Facebook. Employees’ frustration at having to go through a data team to get the information they needed has been channeled into Qubole, a Santa Clara, Calif.-based startup that was generating buzz at ApacheCon recently for its focus on automation.
Facebook put data at the heart of everything it did, according to Ashish Thusoo, a member of that team, and now CEO and co-founder of Qubole.
Yet “it got to where if it was too painful to get the data, they went ahead without it — they wanted to move very, very fast,” he said. “It was a bad architecture for Facebook, and really, it’s bad for any company. … We took a step back and said, ‘This kind of architecture is very important for any organization that wants to become data-driven.’” he said.
Thusoo and Joydeep Sen Sarma — they also created what is now Apache Hive — launched Quoble in 2011 and released their first product in 2013. Qubole aims to manage infrastructure, allowing data teams to focus on analyzing and using the data.
Advocates of a trend similar to DevOps taking place among those who operate and use data technology, Thusoo and Sarma have also written a book on the subject, “Creating a Data-Driven Enterprise with DataOps,” published by O’Reilly Media.
In May, the company announced what it calls its “autonomous data platform”: its Qubole Data Service (QDS) as community and enterprise editions.
The platform, the company claims, self-manages, self-optimizes and learns from your usage to run in the most efficient and economical way. It runs on Amazon Web Services, Microsoft Azure and Oracle Bare Metal Cloud. In addition, the company added Spark on Google Cloud Platform in January.
With Qubole, a data scientist can spin up hundreds of clusters on their chosen public cloud, have the system autoscale to the optimal compute levels as needed and begin creating ad hoc and/or batch queries in less than five minutes, according to the company.
The cloud enables the decoupling of compute and storage, which is key to its architecture.
It automates on three three levels: infrastructure management, data management and workload management.
It uses application-aware autoscaling algorithms that look at the workloads coming in and create the infrastructure, including heterogeneous clusters, clusters with different machine profiles and more.
It provides information on how to best structure their datasets to reduce the time it takes to get answers and increase infrastructure efficiency. And it guides users in managing workloads such as by reusing data from existing workloads or joining data from datasets.
It also offers notebooks-as-a-service and SQL workbench-as-a-service as well as API connectors to other data tools like Tableau.
The company also issued a new set of agents, including:
- Workload-Aware Auto-Scaling Agent, which optimizes cluster size precisely to workload requirements and dynamically scales based on actual processing load.
- Spot Shopper Agent (AWS Only), which shops across AWS cloud to assemble the compute instances in the optimal combination of performance and cost.
- Data Caching Agent, which optimizes the location of your data for fast, interactive access speeds. Data accessed less frequently is intelligently moved in the background for the best performance.
Elastic and MapReduce often are considered Qubole competitors, but Thusoo says its differentiates in several ways:
- Choice: The same platform runs on Azure, on AWS on Oracle Cloud, on Google. If customers want to build out a workload on multiple clouds or if they want different workloads in different clouds, workloads don’t have to be retooled.
- Analysts, data scientists, and developers all can use the same platform. Competitors don’t offer the level of automation or self-service as Qubole, he said.
- It provides this access at a fraction of the cost by driving hardware utilization more efficiently.
Users on a single account use multiple of its as-a-service offerings for different use cases, such as Spark for machine learning, Hive on ETL, Presto and Hive on log analysis. It’s processing 750PB of data a month, Thusoo said.
“Our value proposition is that from a single platform, they can subsume all these use cases,” he said.