StarRocks Launches Beta of Cloud Service for Its Analytics Engine
StarRocks, makers of a specialized analytical database of the same name, announced on Thursday the beta release of StarRocks Cloud, a fully managed service built around the database. StarRocks Cloud will initially be available on Amazon Web Services; support for Microsoft Azure and Google Cloud will follow. General availability for StarRocks Cloud is slated for Q3 of 2022.
The New Stack spoke with StarRocks vice president of strategy, Li Kang, who broke down what StarRocks is all about. Kang explained that StarRocks is a database focused on real-time analytics/online analytical processing (OLAP) queries. It’s a relational database that’s fully MySQL compatible, in terms of query language and client protocol. Still, it can also work as a data lake query engine, as well as with streaming data and change data capture (CDC) sources. Kang also stated that StarRocks can handle tens of thousands of concurrent users, though he did not mention any specific documentation or benchmark that substantiated this metric.
StarRocks was founded in 2020 by several members of the Apache Doris team who spun off to form the company. Although the StarRocks product was initially based on Doris, Kang says that 80% of the codebase is now new. The company says its most recent funding round was a $40M Series B, and total funding to date is $60 million.
To Denormalize or Not to Denormalize?
StarRocks is a database built for OLAP/analytical queries. It features a cost-based optimizer, a fully vectorized query engine, pipeline execution and what the company calls intelligent materialized views. StarRocks competes with the likes of Rockset, Apache Pinot, Apache Druid and Clickhouse. Like those products, it is a relational database with specific design points to mitigate the cost and complexity of the kind of multi-table joins needed with a star schema, the common structure for databases built for aggregational and drill-down queries. However, StarRocks differentiates itself in the way it mitigates these costs and complexities.
Several of its competitors — Druid for example — get around the costs of star schema joins by building a huge denormalized table. That helps query performance but prevents agility around updating the database with new data or dealing with changes to its structure. StarRocks, on the other hand, can query the root tables to avoid processing latency, and can *also* maintain and query a denormalized table, but will use it only when necessary.
StarRocks says its database is in use at over 500 companies. Kang told The New Stack that Airbnb replaced a combination of Druid, Presto, Apache Hive and Apache Spark with just StarRocks to implement the backend data layer for its in-house Minerva metrics store. He also explained that a major social media app company, for its real-time advertising data platform, went through a progression of using Amazon Redshift, Hive/Presto and then Clickhouse, before standardizing on StarRocks.
Of OLAP Old and New
As someone who has worked with business intelligence technology since the late 90s, it’s hard not to draw a parallel between the denormalized/normalized dichotomy with Druid, Clickhouse and StarRocks with the older-vintage split between MOLAP (multidimensional OLAP) and ROLAP (relational OLAP). In the MOLAP case, you had a structure that was optimized for fast queries, but had a large processing burden when new data was added and was less than agile with respect to structural changes. In the ROLAP case, processing was minimal, with the model built for semantic purposes only, and not physically materialized. Some platforms could even mix MOLAP and ROLAP models, an approach that seems similar to StarRocks’ hybridization of denormalized and normalized tables.
In both the old OLAP case and that of the newer platforms, everything ultimately comes down to an architectural choice of “pay me now or pay me later.” Which approach is best will depend on the use case. For operationalized analysis, where the same queries are executed repeatedly and routinely, on historical data, the MOLAP/denormalized table models often work best, despite their up-front costs. For exploratory cases, and working with real-time data, the ROLAP/normalized table structure cases are often best. When analyzing “known unknowns,” doing the work in advance is a good investment. When looking at “unknown unknowns” it’s better not to do too much-advanced optimization.
StarRocks’ balanced approach, therefore, seems ideal even as its new cloud service will join a very crowded field indeed.