“The complexity that modern data infrastructure imposes on developers we thought was insane,” Venkataramani said. “Going from useful data to useful apps requires too many hurdles. Not only the cost of building the complex infrastructure it was high, but the personal cost of upkeep was very, very high. We were just thinking about how do we make it really easy for developers and data scientists to build apps, especially in the cloud.”
Venkataramani, the new company’s CEO, had been managing online data infrastructure at Facebook. Fellow co-founder and chief technology officer Dhruba Borthakur was one of the founding engineers of the Hadoop Distributed File System. Chief architect Tudor Bosman co-created the Facebook graph search backend. He also worked at Google on Gmail’s storage and indexing backend, and at Oracle on database servers.
They maintain too much time is being spent on data preparation and ETL (extract, transform and load).
“Much of the infrastructure for building data apps was optimized for running on-prem or dedicated Linux servers,” Venkataramani explained. “While workloads were moving to the cloud, we were thinking about how to eliminate a lot of the steps. We were thinking about what is the simplest and most powerful product we can build?
‘We thought, ‘What if we just index all this data and make it easy for people to do fast SQL on raw data directly so we eliminate a lot of steps in data preparation, pipelines, schema modeling, performance engineering and all these things in a serverless fashion in the cloud.’ [That way,] we could eliminate a lot of the things that current technology imposes on them.”
SQL Power without Pain
SQL is the database of choice for a majority of Big Data applications, but querying unstructured data on SQL remains painful, Peter Bailis, assistant professor of computer science at Stanford University writes in a blog post.
“Querying an unstructured data source using SQL for use in analytics, data science, and application development requires a sequence of tedious steps: figure out how the data is currently formatted, determine a desired schema, input this schema into a SQL engine, and finally load the data and issue queries,” he wrote. “This setup is a major overhead, and this isn’t a one-time tax: users must repeat these steps as data sources and formats evolve.”
The Rockset answer to this is to develop its own storage and indexing technology built atop RocksDB. Founded in 2016, the San Mateo, Calif.-based company recently came out of stealth and announced an $18.5 million Series A, led by Sequoia and Greylock, on top of $3 million in seed money raised earlier.
Offered as a SaaS product, Rockset is a serverless search and analytics engine that combines the power of search engines with columnar databases, providing provides fast SQL on diverse data. It relies of strong dynamic typing and indexing to make that happen. And it takes advantage of cloud auto-scaling to provide cost efficiency.
Rockset does not require upfront schema definition or data denormalization since it handles semi-structured data formats such as JSON, Parquet, XML, CVS, TSV by indexing and storing them in a way that can support relational queries using SQL, according to a white paper outlining its architecture.
Data from Anywhere
It can ingest data from real-time streams, data lakes, databases, and data warehouses without building pipelines. Rockset continuously syncs new data as it comes in without the need for a fixed schema.
It is optimized for key-value, time-series, document, search, aggregation and graph type queries. The Rockset query optimizer uses a hybrid of rule-based and cost-based optimizations employing machine learning to learn a customer’s query patterns and make them more efficient.
“We store the data in our own proprietary format, our own way of sorting the data,” Venkataramani said. “We take a complex data set and shred it into a whole bunch of little pieces and organize that in our back end in a way that we can power very fast SQL processing on top of that. … Right now, a lot of the processing is happening at write time ….where you need to handle all these edge cases of data preparation before the data is loaded into a database. We move that to the query processing without sacrificing performance or scale.”
Elasticsearch and other search-based processing systems use similar approaches, he said.
“You can turn single, semi-structured data streams in Elasticsearch and build applications on top of that. But at Rockset, we take it to a whole ‘nother level. We are built for the cloud so there’s a lot of elasticity, not just in indexing. We give you full-feature SQL so you can build complex applications that need joins and aggregations and the much more sophisticated processing that SQL systems can do.”
Rockset uses a microservices architecture using containers and Kubernetes with a cloud-agnostic approach. It employs RocksDB-Cloud as an embedded storage engine, along with a custom resource scheduler and custom C++ query processing engine. Ingestion and querying are auto-scaled separately based on limits set by the user.
Though designed to be cloud-agnostic and can be run on any cloud, so far all of Rockset’s services are run and hosted on AWS and follow AWS security practices.
Venkataramani sees uses for Rockset in personalization engines, IoT, security analytics and other real-time applications.
“You could easily point Rockset at a Kafka topic, and you would get a very fast SQL table on the other end to query and build applications on top of,” he said. “Data scientists really like this because they can run a lot of experiments, test a lot of hypotheses and then go into production with it because the SQL processing part of Rockset is at production speed. You don’t need to stand up more downstream serving engines to build your application on top of Rockset.”
Rockset does not support OLTP and the company has no plans to address transaction processing anytime soon, Venkataramani said.
“Want to focus on scenarios where the data is being produced in one application but being consumed by somebody else,” he said. “That’s where OLTP applications fall short. They’re very good at serving the data stored in them, but they’re not optimized to serve data generated elsewhere. That’s where we shine. We can build operational applications on any data set, and it does not have to be fully managed.”