Starburst’s CEO Decries Big Data Lies; Touts Data Truths
At Starburst Data’s annual Datanova event last week, CEO Justin Borgman presented a keynote focused on what he called “5 Data Lies” and the corresponding “5 Data Truths.” Borgman used the lies, truths and a few interesting customer case studies as context to announce key new features of the Starburst platform, including Starburst Enterprise on-premises, and Starburst Galaxy in the cloud. In this post, I’ll review and comment on all of the above, and finish with some observations about Borgman’s career, Starburst’s heritage, and how they both contribute to the company’s point of view and strategy.
The first data lie, in Borgman’s estimation, is the notion that organizations need to centralize their data. Borgman says that doing so is slow, expensive, provides a limited view of data, and carries with it a significant risk of vendor lock-in. He adds that, even if none of those points were at issue, full centralization of data is, in any case, impossible. I’ve argued for several years now that the industry needs to stop treating the distributed and dispersed nature of data as a flaw, and embrace it as a feature. Given that data needs to be near the systems that generate it, and that there are so many of those systems, we can never really catch up.
Borgman advises that instead of centralizing, organizations “need to optimize for decentralized data.” Of course, the Trino technology that Starburst’s platform is based on is all about creating a virtual, federated data layer, that integrates numerous data sources. Even if some data will need to be moved or copied closer to the systems that analyze it, being able to incorporate a mix of local and remote data is key. Why? Because it’s not feasible to create pipelines that move all the data, or to exclude from analysis any data that has to remain in place, and doing either will not facilitate a truly data-driven culture.
Starburst Galaxy’s Data Products Solution
Beyond its physical location, another question of centralization versus decentralization involves whether control and stewardship of data should be concentrated within one group, or distributed among the business units that work with that data. The popular “data mesh” methodology endorses a decentralized approach, where each business domain is responsible for creation of its own “data products.”
Data products, essentially, are data sets; in some cases, they can also be APIs or applications. But regardless of what form they take, data products are accompanied by metadata and documentation, along with robust support and promotion, provided by the entities who work with that data most closely. Distributing such cross-functional responsibility encourages data sharing and should increase trust in the data since its curators have contextual knowledge of it.
To that end, Starburst is announcing a private preview for Starburst Galaxy of a low-code solution to build data products, and an automated data catalog, with search and discovery capabilities designed to create a marketplace experience for data. According to Starburst’s press release, the catalog features will help “to make data products easy to find and consume.” The solution builds upon data/schema discovery and data privilege capabilities that had already been announced, and an earlier data products solution that had been launched as part of the on-premises Starburst Enterprise.
Modern Data Family
The next lie: the idea that the modern data stack is… modern. Borgman says it’s not. In fact, he asserts that today’s data stack is the same stack from decades ago and it’s simply been moved to the cloud. He counters that modernization has to result from a process and be manifested in implementation, rather than the technologies being implemented. Borgman has a point here, given that today’s cloud data warehouses are not especially different from the appliance-based on-premises products that became popular nearly two decades ago and had been introduced as early as the 1980s.
Borgman told me about one Starburst customer, Priceline, which has been moving towards adopting the data mesh methodology as part of its initiative to leverage significant amounts of streaming and historical datasets, stored across different systems. Priceline needed its analytics solution to span its existing Oracle on-premises data warehouses as well as a number of Google Cloud properties, including a Google Cloud Storage (GCS) data lake, BigQuery and various Cloud SQL instances. Clearly, a mix of old and new, this customer use case presents a nuanced view of what a modern data stack really is. And, because Starburst’s underlying Trino engine knows how to connect to and query across multiple data sources, Starburst has apparently been a very good fit for Priceline.
The third lie, Borgman asserts, is that organizations are ready for the “AI and ML deep end,” i.e. more advanced or operationalized implementation of artificial intelligence and machine learning. Borgman says that when companies start building their ML models first and tending to the underlying data second, they’re going about it the wrong way. Instead, Borgman argues customers “need to set the proper foundation before benefiting from expensive AI + ML tech & talent.” In fact, newly published findings from research conducted by Boston Consulting Group, and sponsored by Starburst and Red Hat, reveal that “only 54% of managers believe that their company’s AI initiatives create tangible business value.”
Fair enough, but when companies are ready to onboard that talent, Starburst will be ready too. A new capability will allow data scientists to work with Starburst Enterprise and Starburst Galaxy using the popular Python programming language. In fact, Starburst says it’s made it possible to migrate PySpark workloads to Starburst & Trino, without rewriting the code. Starburst recognizes that, even when building their ML models on Apache Spark, customers want some of the data wrangling and feature engineering code they might author there to run directly on Trino for efficiency. Apparently, that will now be possible.
Even if companies are ready to hire data analysts and data scientists, there are many other roles and skill sets that they may feel they lack. Data modelers, data engineers, machine learning engineers and other high-end talents may be hard to find, attract and afford.
That brings us to Borgman’s fourth lie, that organizations “need to hire to close the skills gap.” Instead, Borgman feels that platforms should reduce the need for specialized skills. Starburst customers Genus (an animal genetic company), and Glovo (a food delivery provider serving Europe and Africa) both needed to provide self-service access through a BI tool to their cloud data lakes. Glovo needed to do so in the context of a data mesh and Genus needed to span data across Amazon Web Services and Microsoft Azure. Both companies used Starburst to meet their requirements and make the data lakes accessible to business users.
Start Your Engines
The fifth and final of Borgman’s data lies in that vendor benchmarks measure real-world performance. Instead, Borgman says, performance is multidimensional and organizations need to measure time-to-insight holistically, using their own workloads.
Still, some extra juice never hurt. To that end, Starburst is announcing a feature called Warp Speed, which it describes as “a smart indexing and caching solution that accelerates queries up to 7X.” It’s available as a private preview for Starburst Galaxy, and will be generally available for Starburst Enterprise by the end of this month.
Starburst explains that its customer Doordash (the well-known restaurant and commerce delivery service) had several workloads that it felt it needed to move off its data warehouse and into its Amazon S3-based data lake, for cost reasons. Meanwhile, it needed to increase performance on those same workloads, something people don’t always assume can be done with a data lake. Nevertheless, Starburst says Doordash moved the workloads to the lake, using Starburst as the query engine, and was able to get 10-15X performance improvement.
So even if a data warehouse vendor — Snowflake in this case — may publish benchmarks claiming to be faster than a data lake solution, in this particular case, Starburst says it was able to deliver performance benefits on the lake. Starburst says Warp Speed “autonomously identifies and caches the most used or most relevant data based on usage pattern analysis,” something that would seem to address Borgman’s lie/truth arguments around both performance and the skills gaps.
A Veteran Benefits
The combination of myth-busting, case studies and feature announcements that Borgman and Starburst are presenting provides a lot to process, but bundling things this way is also a nice change from, and more interesting to write about than, a typical new release reveals. Instead of just laundry-listing new features, Starburst is presenting them in the context of their value and of how they address some problematic rhetoric in the industry.
Given the source, I’m impressed but not surprised. Borgman founded a company called Hadapt that offered one of the first commercial SQL-on-Hadoop solutions on the market. In 2014, Hadapt was acquired by data warehouse pioneer and juggernaut Teradata, where Borgman spent several years, watching what that company did right, and did wrong, in response to changing technology and changing customer demands.
After Teradata, Borgman founded Starburst and hired the team that created the Presto query engine at Facebook. That team engineered the Trino engine from this Presto heritage, then Starburst built an enterprise distribution and a cloud service around the tech. Borgman has learned from his own adventures, successes and errors, and those of others that he observed. If Starburst’s platform was developed in such a context, then it makes sense to present its new features in context, too. Whether or not Starburst is the right platform for your organization, the company’s strategy and its reasoning provide valuable lessons in how to manage and query data and leverage it as an operational and competitive asset.