Snowflake Platform Gets Generative AI, ML, Data Lakehouse Features
At its Snowday event last Wednesday, leading data cloud contender Snowflake rolled out a number of new capabilities on its platform. And while I could have used that very same sentence to cover Snowday in years past, this is really the year that the pieces — including organically developed and acquired technologies — seem to be coming together.
This year’s announcements started off with the long-awaited public preview of Snowflake’s support for Iceberg tables, and segued into a collection of developments in the data governance, developer and AI realms, with the last of these including a model registry, a feature store, vector search and LLM inferencing.
As in past years, Snowflake’s new capabilities and technologies run the gamut of shipping readiness: a couple of them are still in development, some are or will “soon” be in private preview, others are or will be in public preview, and a few are now generally available.
Breaking the Ice
Let’s start with the Apache Iceberg table capability. Iceberg is an open table format that, most commonly, takes data stored in Apache Parquet format and enhances it, allowing Snowflake to use Iceberg tables as if they were standard, native ones.
Once that capability is in GA, it will allow Snowflake to take on data lakehouse workloads, in addition to warehouse workloads, making it a key pillar to Snowflake calling itself a true data platform, rather than a cloud data warehouse.
Snowflake first debuted its support for Iceberg two full years ago. It’s still not GA, but Snowflake has assured me that the public preview will kick off this month. That’s good news.
For now, though, there is a distinction to be made between Iceberg tables managed by Snowflake and those managed by another engine/platform. For example, tables in Cloudera Data Platform or Dremio can be implemented in Iceberg format, which puts the tables’ metadata in one of those platforms’ catalogs, rather than Snowflake’s. In such a case, Snowflake will, for now, treat these tables as read-only.
Users on the other platform’s side could still carry out their updates, and Snowflake would see these and process the data with performance comparable to working with data in its original, native format, but it would not be able to update them. Snowflake’s catalog integration, one step to addressing this, will also be in public preview soon. Work on an Iceberg catalog REST API — which will fully clear the hurdle of non-Snowflake-managed Iceberg tables being used in a native, read/write manner — is in development, with no announced timeline for private or public preview.
On the Horizon
Iceberg support now falls under Snowflake “Horizon” — a new umbrella brand for all of the company’s features pertaining to compliance, security, privacy, interoperability (including Iceberg) and access. Beyond Iceberg support, there were a number of Horizon features announced at Snowday, including a new Trust Center interface for managing security; a data quality metrics monitoring and alerting facility; a data lineage UI; custom data classification as well as a universal search facility that I’ll detail later.
Each of these features is entering private preview. In addition, new automatic data classification and differential privacy features were announced as being in development.
One other management feature, though not technically part of Horizon, is a new cost management interface, providing “visibility, control, and optimization of Snowflake spend” according to the company.
The interface provides account-level spend and usage metrics, including a lists of the most expensive queries executed, rarely used objects carrying costs that could possibly be eliminated, and top warehouses by cost. The cost management interface is entering private preview.
SELECT Generative AI
As a result of Snowflake’s acquisition of search company Neeva in May of this year, Snowflake announced a really broad selection of AI and ML features. What’s most impressive about these is how easy they are to use. A new component of Snowflake, called Cortex, provides backend integration with, and elastic compute around, a number of large language model services and traditional machine learning modeling libraries.
The capabilities around Cortex are available through simple function calls, usable from both SQL and Python, making them highly accessible to technologists without a data science background. This includes full vector embedding generation, storage, indexing, search and even support of a native embedding data type.
Cortex LLM-based functions for SQL and Python are going into private preview. They can handle language translation, text summarization, sentiment detection, vector embedding generation, vector search, and, of course, functions for sending prompts and contextual data to an LLM to solicit verbose or short answers. ML-based functions for forecasting and anomaly detection will be in GA soon. A “top insights” function is in public preview and a classification function will be in private preview soon.
‘Traditional’ ML and LLM-Based Experiences
A full Snowpark ML modeling API is being provided for data preprocessing and model training, and that meshes well with some new impressive MLOps capabilities, namely an ML model registry and a feature store. The model registry can accommodate models built in Snowflake, as well as models built externally, and can deploy them to service inferencing requests. Similarly, the feature store can be used to generate training datasets as well as to serve features in production for inference. The Snowpark ML modeling API is GA now; the model registry is in public preview and the feature store is in private preview.
New LLM-powered tools — or “experiences,” as Snowflake likes to call them — that were built on Cortex, are also on offer. They include Document AI (for making unstructured documents like invoices fully queryable with natural language), Copilot (for natural language-to-SQL query translation) and Universal Search (allowing searches for tables, views, databases, schemas and Snowflake Marketplace listings). All three are in private preview.
Developer and DevOps features are important too, and they have in no way been left out of the Snowday announcements. To start with, given all the AI-related work you can do with Snowflake, and the compatibility of those features with both SQL and Python programming, Snowflake has seen fit to add a new notebook coding interface right inside the Snowsight UI, in private preview.
These are not full-fledged Jupyter notebooks — they’re Snowflake-specific ones, whose cells can contain SQL, Python or Markdown only — there are no kernels for Scala, R or other languages. That said, developers familiar with Jupyter notebooks should be right at home and will be able to use Cortex functions and the Snowpark ML modeling API right out of the gate. Streamlit chart elements are built in as well, to handle data visualization needs.
There’s also integration with Git-based CI/CD and version control in private preview, a new command line interface (CLI) in public preview, database change management capabilities in private preview soon and a new Event Tables feature that is GA. Beyond that, the Native App Framework which allows bundling of data, code and containers, will be GA on Amazon Web Services soon and will be entering public preview soon on Microsoft Azure.
Putting It All Together: Generative AI Demo
After taking journalists and analysts through a full inventory of these new capabilities, Snowflake’s Director of Product Management Jeff Hollan made it real by taking us through an end-to-end coding demo in Snowflake’s new notebook environment. He demonstrated a reasonable scenario of mining customer service chat transcripts, both in English and German and wiki articles, to get a summary of issues facing a fictitious ski equipment retailer. In the demo, Hollan was able to:
- Translate several German-language transcripts into English
- Summarize an English-language transcript generically
- Summarize it again with a custom prompt, asking for specific pieces of information
- Generate embeddings for a series of wiki articles in a Snowflake table and store those embeddings in a new table
- Do a vector search in that new table for articles relevant to a specific prompt
- Create an embedding for the prompt, and send it, along with the embeddings for the relevant wiki articles, to the LLama 2 LLM for a retrieval augmented generation (RAG) response to the prompt
By my count, the entire demo involved six SQL queries (a couple of which were elaborate, I’ll admit) and six lines of Python code, and was easy to follow. By the time Hollan demoed a full Streamlit app with similar functionality, I was tempted to say “You had me at embedded SQL functions,” but I kept that to myself. While most controlled demos have some amount of “smoke and mirrors” behind them, and this one relied on a number of technologies that are still in private preview, it seemed very reasonably based on the Cortex features that the Snowflake folks had detailed.
With that in mind, I’m willing to suspend most of my usual demo skepticism.
The Power of Integration
The ability to do what Hollan showed us has been out there in the market for a while, but it has required background knowledge, connecting the dots, and threading the needle, to mix a few metaphors. It’s been up to the developer to extract data, get embeddings done somewhere, do the vector search in a database, then extract the needed context to send to an LLM on some other platform.
While it’s not impenetrably hard to do that, it’s not terribly convenient either. It’s also a bit “Rube Goldberg” in nature and requires extracting data from its native platform and sending it somewhere else.
Snowflake has made a lot of this, I dare say, rather turnkey. Everything’s set up. It’s accessible through simple SQL or Python interfaces, which can be used, individually or in combination, from the new Snowflake notebook interface. The data started in Snowflake, and stayed there, and, because of all that, the power of generative AI seemed much more concrete than I’ve seen in other contexts. And for customers who want to use machine learning models to do more straight-up predictive analytics, the experience would be similarly straightforward.
Snowflake has more work to do to get all this stuff to GA. And in the interim, competitors may catch up to a certain degree. But right now, the company is providing an impressive value proposition to its enterprise customers and very solidly showing that the key to AI success is a solid analytics platform.