Snowflake Builds out Its Data Cloud
At its annual Summit event in Las Vegas Tuesday, cloud data rockstar Snowflake made a slew of announcements, showing its intention to move beyond its customary data warehouse specialization and take on numerous other categories in the data and analytics space. These include data lake, operational database, streaming data, and data science use cases. The company is even targeting provision of an application platform. Clearly, Snowflake is taking steps to make its Data Cloud vision more than rhetorical.
I’ll cover a majority of the announcements, starting with the three I think are the most transformational: external tables for on-premises storage (now in private preview), support for Apache Iceberg-based tables (under development) and row-oriented storage (also in private preview). I’ll also cover a slew of developer-oriented reveals and even provide a summary of some partner announcements too. Here we go…
The Data Cloud Makes House Calls
The first of the announcements to cover — that of external tables for on-premises data — is fairly straightforward to describe. It allows Snowflake to see and query data that is stored in object storage physically outside its purview. Any Amazon S3-compatible object store can be federated with Snowflake and this includes on-premises storage. While S3 may make you think of Amazon Web Services and the public cloud, the reality is that the S3 API has become a standard for enterprise object storage platforms that are typically installed in the corporate data center — and any data that is stored there is now accessible by Snowflake. This apparently includes the ability not just to query that data but also to govern it.
Such capabilities are not unique to Snowflake. Most operational databases already have external table facilities which provide similar functionality. Microsoft is even adding S3 API compatibility to its external table facility in SQL Server 2022, the next version of its flagship database, which is now in public preview. Many of the platforms that do have external tables tout them as their bridge to querying customers’ data lakes. In the Snowflake case, you could use external tables that way, but the company is taking this capability to the next level with another new feature: Apache Iceberg tables.
Tip of the Iceberg
In a special briefing for analysts last week, Christian Kleinerman, Snowflake’s Senior Vice President of Product, said “Probably one of the most significant announcements at the conference is going to be the introduction of Iceberg tables.” While it’s still in development, Snowflake says it will be in private preview soon and, frankly, even the announcement of Apache Iceberg support is intriguing. Iceberg is an open source technology for adding transactional/ACID (Atomicity, Consistency, Isolation and Durability) logic to data stored in open file formats like Apache Parquet. It also provides time travel capabilities, so you can see previous states of your data, before certain updates, inserts or deletes were performed. Iceberg is one of three major technologies to provide such functionality, with Delta Lake and Apache Hudi being the other two.
As it turns out, Snowflake will make Iceberg a first-class storage option for tables in a database — effectively making it an additional storage format to be natively-supported by Snowflake’s query engine. Customers will ostensibly be able to choose Iceberg as the persistence format on a table-by-table basis, with support for virtually all operations that a conventional table would support, including standard DML (data manipulation language) and CRUD (create, read, update and delete) operations as well as encryption, replication, governance, compaction, marketplace compatibility and clustering, all with performance that Snowflake says is comparable to (within a single-digit percentage of) that of conventional tables.
Meanwhile, because Iceberg is an open format, data in it can also be queried by a number of open source data lake technologies, including Apache Spark, Hive and Flink as well as Presto and Trino. And, just as important, it can be used from data science environments too. Iceberg support is essentially Snowflake’s manifestation of the data lakehouse paradigm on its own platform, and it ties into a number of developer-oriented enhancements to Snowflake that we’ll cover in a bit.
Before we get to the new dev features, let’s take a look at Snowflake’s other major announcement: Unistore and Hybrid Tables. Together, these technologies implement the option of row-oriented storage that will allow Snowflake, for the first time, to take on operational database workloads.
Storing all the values in a given row together is very efficient for adding new rows of data, or looking up particular rows, then updating or deleting them. But if you’re doing analytics, where you typically aggregate the values in one or two columns over a huge array of rows, it’s better to store those column values together, so they can be counted quickly.
That’s why operational databases use row store technology natively, and why most data warehouses use column store technology natively. That’s also why, in recent years, many operational databases have added column store technology to their row store wheelhouse, in order to take on so-called operational analytics workloads.
In the case of Snowflake’s Unistore and Hybrid Tables, we have the converse: a columnar data warehouse platform adding row store technology in order to take on operational workloads. A Hybrid Table, as the name would suggest, allows a table in the database to be stored in both formats, so that all workloads can be used against it. Hybrid tables can be joined with conventional tables in queries, too, and Snowflake will take care of the mechanics necessary to make that work.
One of the biggest reasons Snowflake is taking on operational workloads is because most business applications work in operational contexts. And Snowflake wants developers to run those apps in Snowflake’s Data Cloud.
Several announcements are being made on the dev front, many of them especially interesting to the Python community. The biggest of these is around something Snowflake is calling the Native Application Framework, now in private preview. Admittedly, there is a gap between the full vision for native applications and what’s in the preview today, but the idea is compelling. The Native Application Framework will eventually facilitate the development of fully-functional apps that developers can build and sell and which Snowflake customers can buy and run within their own Snowflake environments. This will allow the data these apps create and maintain to be very local indeed, making for good performance. It will also relieve developers of the task of deploying their own apps in this manner and yet will ensure that developers will not have direct access to a customer’s data. So, in one fell swoop, the Native Application Framework will address performance, security and compliance for customers, and will allow even small developers to deliver those guarantees.
For now, native apps will really just provide a mechanism to package and deploy Snowflake assets like stored procedures, user-defined functions and external functions. Whether or not you should consider a collection of these assets to constitute a true application may be up for debate. But the plot will thicken. That’s because of Snowflake’s acquisition earlier this year of Streamlit, a Web application development platform. Snowflake says it’s working diligently to integrate Streamlit into its platform, and thus into the Native Apps Framework. This will enable the development of application front-ends using Python, in addition to the Snowflake server-side assets mentioned above. And, again, the availability of Hybrid Tables and Unistore will mean these apps can be operational in nature, rather than just delivering analytics functionality (although that will be possible as well).
Snowflake says it is doing all this because it has taken note of just how much data its customers have in individual SaaS applications and the pain points involved in bringing that data into the warehouse, be it through data pipelines or data virtualization. Snowflake says the popularity of cloud-based applications has essentially “re-siloed” data. The Native Application Framework initiative is therefore designed to de-silo this data, by placing applications and their data directly into customers’ Snowflake environments.
Fortified with Streaming and Security
Snowflake is also adding support for serverless ingestion of streaming data, through a feature called Snowpipe Streaming, which is currently in private preview. And the company has an additional feature, still under development, called Materialized Tables, which will provide a mechanism for declarative transformation of the data being streamed in.
In a somewhat related announcement, Snowflake says it is adding support for a cybersecurity workload, which will allow for intense analysis of high volumes of structured, semi-structured and unstructured log data.
Staying on the dev front, Snowflake is going all-in on Python technology. In addition to the aforementioned integration of the Streamlit platform, which uses Python, directly into Snowflake, the company is announcing, as the result of its partnership with Anaconda, the public preview of Python as a supported language on its Snowpark developer framework. Snowflake will also support the use of a curated range of open source Python packages and libraries right from the Snowpark for Python environment.
And there’s more. A new feature called Snowflake Worksheets for Python (now in private preview) allows pipelines, ML models and applications to be written in Python and developed directly in Snowflake’s user interface, Snowsight. Large Memory Warehouses, currently under development, will facilitate a number of scenarios for memory-intensive tasks. For example, data science workloads — like feature engineering or training models — will be able to execute directly on huge datasets in Snowflake, with appropriate memory resources to back them up and without needing to extract the data out to a different environment — both of which will contribute to better performance.
And once those models are trained, non-data scientists will be able to run predictions against them without having to use Python at all. That’s because Snowflake will add a feature called SQL Machine Learning (currently in private preview) that lets developers with SQL skill sets embed predictions against ML models directly in their queries.
Partners in Data
If all of those announcements from Snowflake itself were not enough, many partner announcements are also being made at Snowflake Summit. Here’s a summary of several of them:
- Informatica announced a new Enterprise Data Integrator that will integrate data from a range of sources natively in the Snowflake Data Cloud; the company is also releasing a new no-cost data loader for Snowflake and the private preview of new Snowflake-specific data governance features in Informatica’s Cloud Data Governance and Catalog service.
- Fivetran announced a new high-volume connector as well as new capabilities in the scope of its Snowflake integration.
- ALTR announced a new policy automation engine for managing data access controls in Snowflake and other platforms.
- StreamSets announced the general availability of its newest engine, StreamSets Transformer for Snowflake, which it says is built on Snowpark
- Master Data Management vendor Semarchy last week announced xDI for Snowflake, which adds Snowflake-specific capabilities.
- Acceldata, a data observability/data quality provider, last week announced its own integration with Snowflake for “insights, control, and spend intelligence of Snowflake environments,” specifically around data, processing and pipeline orchestration.
- Rockset announced its own integration with Snowflake to provide developers access to data from sources including Apache Kafka, AWS DynamoDB and MongoDB, and combine it with data in Snowflake.
There is a lot to absorb here, and it’s important to keep in mind that most of the technologies being announced are in preview or still under development. But they’re coming. And once they’re here, they will vastly increase the number of fronts on which Snowflake is trying to compete. Although the company has always seen itself as more than a data warehouse, the fact remains that it had previously stayed very focused on the analytics space and competed very strongly there.
The moves announced at Snowflake Summit change the game significantly and put the company in some very crowded spaces — like operational databases and application development — where sometimes the competition is in a race to the bottom. How will such a hot company deal with the tough slog these new areas of competition may entail? We’ll see. But those who might bet against the company should be forewarned: the collective industry experience of Snowflake’s management team is formidable.