Data / Open Source / Technology

Starburst’s Varada Move Consolidates the Lakehouse Race

5 Jul 2022 10:20am, by

Starburst Data, a key data lake/data virtualization player based in Boston, announced last month its acquisition of Tel Aviv-based startup Varada, a provider focused on query acceleration. Both companies are in the analytics subgroup focused on Trino, an open source data query engine derived from the Presto project, originally incubated at Facebook. The two companies’ technologies are highly complementary, yielding a very logical consolidation with an excellent prognosis for successful execution.

Starburst is really the charter member of companies focused on Presto/Trino technology. It employs Presto’s original creators and birthed the Trino project. Varada also based its platform on Trino, but improved upon the raw technology by developing a sophisticated caching, optimization and indexing layer that rides on top of it. The combination of Trino’s massively parallel processing (MPP) query engine and Varada’s query optimization layer means that the Starburst platform will deliver not just on analyzing large volumes of data and connectivity to a variety of backed systems, but also on high-end query performance, bringing it more squarely in competition other players in the data lake space.

Merger of Common Sense

The New Stack spoke with Starburst CEO and co-founder Justin Borgman, who said he sees the acquisition of Varada as “arguably the most obvious acquisition that [it] could have made” and is optimistic about the integration. Of Varada, he told The New Stack “we knew them, we knew their talents, we were big fans of their work and they had basically built… a feature for our technology. And, and as a result, [it’s] probably going to be one of the fastest integrations post-acquisition that you’ve ever seen.” Borgman added that “within 90 days, this will be generally available on all three clouds and ready to roll.”

Borgman says the addition of the Varada indexing technology will speed queries by as much as 7x without requiring customers themselves to choose what data to index. Borgman also says that Varada’s machine learning-based use of SSD-oriented data caching, based on observed query patterns, reduces actual query effort, and can thus allow customers to run on smaller Trino clusters than they otherwise might. That, in turn, can lower cloud compute expenses; in fact, Starburst claims savings can be as high as 40%.

The combination of performance improvement and cost-cutting leads Starburst to pitch its soon-to-be Varada-imbued platform as one especially well-suited to an economically sensitive time in the market.

Birds of a Feather

Starburst and Trino already had MPP, a standard feature on most data warehouse platforms, and the ability to connect to data in standard formats like Apache Parquet. Add in the query optimizations that the Varada technology will bring on board, and Starburst is putting itself more squarely in a position to take on so-called data lakehouse workloads.

Think of it: Databricks, which pioneered the lakehouse paradigm, recently released its optimized Photon engine to general availability and has open sourced its Delta Lake file format that layers transactional consistency and time travel capabilities on top of Apache Parquet. Dremio, which uses its Parquet and Apache Arrow-based Reflections technology for fast querying, also supports Delta Lake but was an early adopter of Apache Iceberg, a competing open format with similar capabilities.

Cloudera, whose data lake technology is based on Apache Impala, itself long focused on pairing data warehouse query engine technology with data in open formats, just announced support for Iceberg last week. Snowflake‘s platform has always been based on relational data warehouse technology yet, at its Summit event three weeks ago, Snowflake announced its intention to support Iceberg as an alternative to its own data storage format. Starburst’s Trino engine, meanwhile, supported both Delta Lake and Iceberg format already. So too did the original Presto engine, which puts Ahana into the competitive fray as well.

Also read:

Gather Round

It seems the more today’s data lake and warehouse platforms differentiate themselves, the more closely they hew to a common set of standards and use cases. That’s a normal tendency in a tech space that’s seen lots of investment and is now in high pursuit of enterprise customers. It also lowers risk for those customers, as it creates a set of conventions for using various open source technologies together, strengthening the efficacy of investment in them.

The broader question is whether companies like Starburst, Dremio and Ahana, that are focused solely on the data lake, will now need to take on streaming, machine learning and data governance workloads too, in order to compete with the likes of Databricks and Cloudera (or, for that matter, the cloud providers). For now, it’s probably better to focus on and excel at a specialty. But the time may come when having a full, cross-workload data platform becomes a competitive necessity and not just a “nice to have.”

Cloudera is a client of Brust’s strategy and advisory firm, Blue Badge Insights.

The New Stack is a wholly owned subsidiary of Insight Partners, an investor in the following companies mentioned in this article: Dremio.