MinIO’s Object Storage Supports External Tables for Snowflake
MinIO provides cloud-agnostic object storage that’s equally at home in on-premises, co-location, and edge environments for workloads involving advanced machine learning, streaming datasets, unstructured data, semi-structured data, and structured data.
Its impact on these data types is of more than academic interest to Snowflake users. MinIO’s capability to supply object storage almost anywhere data exists nicely complements Snowflake’s notion of external tables, which minimize data movement, decreases costs, and allows organizations to apply more of their data for any given use case. The company had a major presence at the Snowflake Summit being held this week and spoke with The New Stack about its relationship with Snowflake.
According to Jonathan Symonds, MinIO CMO, Snowflake “wants access to more data and not less and so, therefore, they basically created this concept called external tables. That allows you to query in place wherever the data may exist.”
When storing data with MinIO, there are few limits to where that data might actually be.
With this paradigm, Snowflake users can query data wherever they have external tables set up which, when working with MinIO’s object storage, might be in adjacent clouds, on-premises data centers, and in edge devices. From an end-user perspective, the data may as well be in Snowflake — sans all the data preparation and data pipeline work required for it to get there. “The only thing that needs to happen is the administrator has to set up MinIO as an external table and give permissions to the user to be able to use it,” explained MinIO executive Satish Ramakrishnan. “So once they see this as an external table, then they can just run their regular queries. To them, it just looks like rows and columns in a database.”
Snowflake is responsible for querying the external data as though it were located internally. Ramakrishnan noted that for external tables, the cloud warehouse “does the same thing it does for its own internal systems, like caching queries and creating materialized views. It does that all automatically.” The performance issues appear to be negligible and are attributed in part to the caching techniques. Ramakrishnan referenced a use case in which external tables were queried from Snowflake and “the first time when it does the fetch it took a few seconds and from then on everything else was in milliseconds…So, we know that there’s a lot of that caching, which they’re already undertaking.”
The in-place querying capabilities Snowflake’s external tables enable in MinIO’s object storage create numerous advantages for the enterprise. The most noteworthy may be that data in distributed environments no longer has to move. Data movement has traditionally been considered a bottleneck and is often costly, if not cumbersome.
“You’re able to actually run this without any of that data movement, either the cost [of it] or having to clean it up,” Ramakrishnan commented about this in-place querying approach. “You can do it on all of your data. And most importantly, it’s current. It doesn’t have to go from a pipeline from your data lake all the way into Snowflake.” Depending on the use case and the velocity of the data, when data pipelines are involved it’s not uncommon for new data to have already been generated by the time data is transported to Snowflake.
The prohibitive costs of such traditional approaches frequently make users have to choose which data they move, preventing them from querying or accessing all of it. Another advantage of the external table approach is that data is accessible from multiple instances of Snowflake, which is beneficial for organizations with decentralized teams in different geographic locations.
“You can have a Snowflake instance in AWS and a Snowflake instance on GCP and still access that same table,” Ramakrishnan remarked. “There’s no data movement required.” There are also fewer copies of data, which helps security, access control, and data governance efforts. Plus, users get a uniform version of their data to support the proverbial single version of the truth. “You don’t have to move data around and you can actually run all your regular Snowflake jobs; queries and applications will all work as is,” Ramakrishnan added.
The overarching significance of object storage may very well be its ability to provide highly detailed metadata descriptions of unstructured and semi-structured data, which is swiftly retrievable at scale. However, Snowflake’s in-place querying via external tables roundly expands those benefits by forsaking the data movement, costs, and latency of data pipelines. The cloud data warehouse’s broad user base may very well avail itself of this benefit as much as it does for other applications of object storage.