Databricks Sees and Raises Snowflake, with Gen AI, LLMOps, More
At its annual Data and AI Summit (DAIS) in San Francisco today, Databricks is making a variety of announcements about its data lakehouse platform, in areas ranging from generative AI to data governance, lakehouse monitoring, query federation and a significant announcement around open table formats.
Databricks’ event and announcements come the very same week that its rival’s Snowflake Summit event is being held in Las Vegas, with numerous announcements made there yesterday, which I also covered. Beyond their simultaneous events, there are numerous symmetries between the two companies, their ambitions and their announcements and, as I review the details of Databricks’ reveals, I’ll highlight the parallels, and the contrasts, between the companies and their latest innovations.
As a set of parallels within the parallels, both companies’ announcements cover services and features in varying states of release, including general availability (GA), public preview, private preview and even features still under development that are being shared in advance of any availability to customers. As with my coverage of Snowflake’s announcements, I’ll work to specify the release state of each Databricks service or feature I cover here.
LakehouseIQ: NLP Assistance for Business Users and Developers
In a briefing, Joel Minnick, Databricks VP of product and partner marketing, commented that the theme behind the DAIS announcements is the very ethos of the lakehouse itself: that the physical data, governance, management and operations of both analytics and AI should be unified, to keep the two from getting mutually siloed. And while that is being established and maintained, both analytics and AI should also become more discoverable, easier to use and thus more integrated into every organization’s day-to-day activities.
With that established, Minnick stated his belief that the introduction of generative AI/large language model (LLM) technology has the potential to open data and analytics beyond the tech user/developer constituency, fluent in Python and/or SQL, that Databricks has always served. Minick says generative AI can provide the discoverability, ease of use and integration necessary to open analytics and AI up to everyone in enterprise.
Leading in that democratization effort is a collection of capabilities that Databricks is branding “LakehouseIQ,” which it is releasing as a public preview. LakehouseIQ has both end-user and developer facets to it. On the end-user side, LakehouseIQ delivers an LLM-powered natural language interface to searching and querying data, which a number of analytics vendors have added recently. But Databricks insists LakehouseIQ goes beyond this, by transcending the limitations generic LLMs have in understanding a particular customer’s data, internal jargon and usage patterns. Databricks says its technology uses the customer’s own “schemas, documents, queries, popularity, lineage, notebooks, and BI dashboards to gain intelligence as it answers more queries.” LakehouseIQ also integrates with Unity Catalog, so that the natural language searches and queries respect access controls in the catalog and respond only with the data the questioner is authorized to see.
On the developer side, there’s LakehouseIQ in Notebooks. In this context, the LLM facility will aid developers with code completion, generation and explanation as well as code fix, debugging, and report generation. Databricks says all of this functionality will be implemented contextually with customers’ data because of its tie-ins with Unity Catalog.
LakehouseIQ bears some resemblance to Snowflake’s just-announced Document AI feature. The latter, which comes to Snowflake’s platform through the acquisition of Warsaw-based Applica, is limited to unstructured data in documents, though. LakehouseIQ, as we’ve seen, applies to structured lakehouse data and code. While the two new features are not perfect analogs, they clearly are intended to entice the same customers, around the same generative AI technology trend.
Don’t Just Use LLMs. Build Them!
Now let’s switch over from LakehouseIQ to Lakehouse AI, which is essentially a rebranding of Databricks Machine Learning, ushered in by the addition of LLM-specific capabilities to the platform. There are a number of facets to this as well, starting with access to a curated collection of open source LLMs, including MPT-7B and Falcon-7B, and Stable Diffusion. These LLMs are available within the Databricks Marketplace, which we’ll have more to say about shortly.
Lakehouse AI also entails a number of “LLMOps” (LLM operations) capabilities, including:
- AI Gateway, which can manage credentials for SaaS LLM services and model APIs; provide access-controlled routes for LLMs (which allow swapping of backend models and providers on the fly), prediction caching, and rate limiting on service usage.
- Prompt Tools, which offers a no-code interface to compare various models’ output based on a set of prompts, with those comparisons automatically tracked within Unity Catalog.
- LLM optimizations to Databricks Model Serving, including LLM inference at up to 10x lower latency time; GPU-based inference support; auto-logging and monitoring of all requests and responses to Delta Tables, and end-to-end lineage tracking through Unity Catalog.
One could compare Lakehouse AI with the capabilities supplied to the Snowflake platform by Nvidia NeMo and AI Enterprise, Dataiku and John Snow Labs. Snowflake has always had an aggressive strategy around partnering, while Databricks has always sought to add capabilities as native features in its core platform.
MLflow 2.5 and Vector Indexes
Another chunk of Lakehouse AI includes enhanced capabilities in the open source MLflow project, including side-by-side LLM comparison, with no-code and SQL interfaces. This capability will be part of the preview of MLflow 2.5, which will be released in July, and with which Databricks has integrated Unity Catalog and Model Serving, allowing experiment auto-logging and tracking in Unity Catalog, and deployment to production of the most relevant model. The company has also enhanced its Automated Machine Learning (AutoML) feature to include training assistance for fine-tuning LLMs.
Want more? Databricks is “pre-announcing” (i.e. discussing before the availability of a preview) a new vector index capability, which will fully manage and automatically create vector embeddings from files in Unity Catalog. Vector embeddings are encodings of text prompts that encapsulate their context and semantics, allowing for more efficient and accurate responses from LLMs. Unlike Snowflake, which is partnering with vendors of specialized vector databases like Pinecone, Databricks says it is choosing to implement vector embeddings itself so that they can be integrated with the ML/AI governance and operations already in the platform.
On the Market
The Snowflake comparisons continue as we examine the announcement that Databricks Marketplace is now GA. The marketplace allows discovery and acquisition of datasets, data assets and, with GA, AI models as well, including the curated set of LLMs I mentioned earlier. Databricks is also pre-announcing what it calls Lakehouse Apps, which will be available in the Marketplace, and which will deploy in the customer’s own Databricks tenant to eliminate data movement and security/access issues.
Lakehouse Apps very closely parallel Snowflake Native Apps, available the Snowflake Marketplace, which deploy to a customer’s Snowflake environment, providing the associated security, access and streamlined procurement benefits. The capabilities and nomenclature of the two companies’ offerings here are so similar that one can’t help but recognize the two companies work hard to achieve parity, even as they strive to differentiate.
Not Just AI
Beyond all the LLM hoopla, Databricks is adding some good old-fashioned operations features to its platform with the preview of Lakehouse Monitoring in Unity Catalog. The feature helps customers understand the performance of all pipelines and AI models, provides automatic alerting of problems and, through Unity Catalog’s lineage capabilities, automatic root cause analysis of those problems.
Databricks is also launching a private preview of Lakehouse Federation. While Databricks’ underlying Apache Spark platform has long had the capability to query data stored in nearly any database or repository for which a native or JDBC driver exists, it seems Lakehouse Federation elevates that capability significantly. For one thing, it would appear that the Photon engine can now handle querying of data outside the Databricks platform itself, and still bring to bear caching and query acceleration enhancements. Moreover, this external data can be tracked and governed in Unity catalog. Supported external data sources include — wait for it — Snowflake, as well as MySQL, PostgreSQL, Amazon Redshift, Microsoft’s Azure SQL Database and Azure Synapse, and Google BigQuery.
Databricks says that, in the future, Unity will be able not only to catalog the data in other systems but to push policy there in order to unify governance of internal and external data. And since Databricks announced previously that Unity will also feature a Hive Metastore (HMS) interface, with which many other analytics tools are compatible, the unified governance will apply not only to data sources but to client applications as well. Databricks says Lakehouse Federation and the HMS interface will be in public preview soon.
Lakehouse Federation has some similarities with Snowflake’s external tables and its Iceberg tables too, with both companies acknowledging that they can’t directly manage all data, and must provide a bridge to it, for comprehensive analytics and AI.
Peace in the Format Wars?
Access to the data catalog by external clients is one thing, but what about the data itself? Databricks has standardized on the open source Linux Foundation Delta Lake project (to which it is the chief contributor) to bring fast updates, ACID consistency and time travel to data in the lake. And while a number of other companies have adopted Delta Lake — including Microsoft with its recent announcements around Fabric and OneLake — two competing open source table formats, Apache Iceberg and Apache Hudi, are out there, building ecosystems of their own. This three-way format war is bad for the industry, especially since all three are based on the underlying Apache Parquet columnar data format (itself the victor in a format war with Apache ORC).
More on Microsoft Fabric: Microsoft Fabric Defragments Analytics, Enters Public Preview
The solution? Version 3.0 of Delta Lake, now in preview, will now support Hudi and Iceberg clients as well, through the new Delta Universal Format (UniForm) technology. UniForm embeds the format-specific metadata not only for Delta Lake, but for Iceberg and for Hudi as well, thus making Delta Lake files client-agnostic, across the three formats. According to Databricks’ Minnick, UniForm can also read metadata for all three formats, making the compatibility bidirectional, and rendering Delta Lake 3.0 client applications and tools able to access Hudi and Iceberg data. Among other combinations, this would provide compatibility between Databricks and Snowflake’s announced Iceberg tables (once that feature ships), providing yet another symmetry.
The best part is that, according to Minnick, none of these capabilities is proprietary to Databricks. Instead, it’s all part of open source Delta Lake project, which should mean platforms using it, ostensibly including Microsoft Fabric/OneLake, will gain the same interoperability, once they adopt the new 3.0 version.
Breadth, Acquisitions, and Loyal Opposition
In his briefing, Minnick commented that “the surface area for Databricks is getting pretty large these days.” And, indeed, Databricks’ numerous new capabilities indicate that it wants to be a comprehensive platform for AI and machine learning; analytics, data engineering, management and governance; and trusted applications. Databricks also now wants to serve business users, and not just techies writing Python code in Notebooks. The “surface area” breadth strategy parallels Snowflake’s, while the play for business users aims to even things up for Databricks, which has historically targeted more technical users.
Also consistent with Snowflake, Databricks is actively acquiring new companies to beef up its portfolio. On Monday, it announced the $1.3B acquisition of startup MosaicML, which focuses on tools for training and deploying LLMs. That builds upon its planned acquisition of data governance provider Okera, and closed acquisitions of marketing analytics provider DataJoy, ML model serving concern Cortex Labs, low code/no code provider 8080 Labs, and data visualization and SQL query tool-focused Redash. While Snowflake’s acquisitions of Streamlit and Applica only somewhat mirror Databricks’ deals with 8080 Labs and MosaicML, respectively, they still drive home the point that the two rivals leverage acquisitions to keep their competitiveness robust.
Many in the industry recognize the rivalry between Databricks and Snowflake and enjoy watching them innovate and compete. Few in the industry were happy to have the two companies run their respective events concurrently, but the confluence certainly brings the competition into focus. Even if the rivalry between Databricks and Snowflake is especially intense, it’s also an exemplar for all their other competitive battles, with cloud providers, enterprise software incumbents and pure-play startups. The data world is highly competitive, and while that can be a challenge for customers, ultimately, it’s a great benefit that keeps the innovation pipeline impressively full.