Cost-Efficient Data Ingestion for Snowflake
Snowflake is a data platform built for the cloud. It provides sheer infinite scalability using a consumption-based pricing scheme. Your cost of operating Snowflake is proportional to its use. Fundamentally, this should not be a concern, because Snowflake’s superb analytical capabilities likely deliver a positive return on your investment.
But can you improve your margins? Are there ways to limit spending without a negative effect on the outcomes?
This post focuses on data ingestion. More specifically, this post focuses on replicating — often high value — transactional relational data into Snowflake. Examples include industry-agnostic applications such as ERP, CRM and SCM systems, industry-specific applications such as EPIC Clarity, and home-grown relational database applications. With data replication, following an initial load, you keep a copy of data in Snowflake up-to-date using Change Data Capture (CDC) on the source(s).
Data replication enables near real-time access to source data in Snowflake. We have found at HVR that, like all other operations in Snowflake, you pay for use.
Snowflake Consumption-based Pricing
At its core, three components make up the Snowflake Data Cloud:
- Compute, in the form of virtual warehouses
- Cloud services, referred to as the “brain,” to make it all work
These core architectural components enabled Snowflake to become a reliable and scalable data storage and processing service.
The three core architectural components are also fundamentally the basis for Snowflake’s consumption-based pricing:
- Customers pay for use of storage in Snowflake. By default, Snowflake provides Time Travel and Fail-safe features on the data. As data changes, Snowflake retains previous versions of the data for a configurable amount of time. Data is always compressed.
- Data is processed using a virtual warehouse. T-shirt size virtual warehouses determine the pay rate, with virtual warehouses automatically spinning down when not in use. Running queries and DML (insert, update, delete) statements both require compute resources.
- On a daily basis, cloud service charges up to 10% of compute consumption use are included. Beyond this, cloud services are charged and visible in the metering_daily_history view.
In addition to these three core components, Snowflake introduced a number of serverless features that are optimized for short bursts of resource consumption. Examples of serverless features are automatic table clustering and the stream data capability Snowpipe. Serverless features are charged for use by the second.
Customers buy credits to pay for compute and cloud service charges, and for the use of serverless features. The cost per credit varies depending on the required service level. The storage cost is a simple cross-charge from the cloud provider for identical storage use in their scalable block storage (AWS S3, Azure Data Lake Store (ADLS), Google Cloud Storage (GCS)).
Replicating Data into Snowflake — Fundamentals
Snowflake was designed with an analytical workload in mind. Processing online transactions with thousands of transactions per second was never the goal. However, a typical Snowflake implementation for any organization consolidates data from multiple high-volume transactional systems. Streaming (inserts only) or micro batching (inserts, updates and deletes) this data into Snowflake is required to keep up with the aggregate change volume.
Consider the importance of a transactional representation of your source data in your analytical environment. Do you require a transactionally consistent view of the data that fulfills all the source’s data integrity rules, or is a (temporarily) inconsistent view acceptable? For example, your ERP system may track shipments and inventory in separate tables. Does your analytical system require that shipments and inventory align based on a consistent state of your ERP system?
Cost-Efficient Data Ingestion
Latency is a key metric for data replication. How can you meet business requirements in terms of latency, while limiting Snowflake costs?
With charges proportional to the amount of processing that takes place you must balance ingestion frequency/rate with latency requirements:
- Applying changes to existing data more frequently leads to multiple older copies of the data at different points in time to allow for Snowflake’s Time Travel and Fail-Safe features. Can you adjust (lower) ingestion rates during certain time windows (e.g. at night or during weekends)?
- Frequent DML as a result of (close to) real-time replication makes more use of cloud services relative to less frequent (micro) batches. Depending on the daily total compute consumption continuous replication may result in separate cloud services charges.
- Snowflake uses a READ COMMITTED isolation model. A statement only sees the data that was committed before the statement began, or changes made by previously completed statements in the same transaction. Take advantage of Snowflake’s transaction model to ensure required consistency, and to allow for recovery in case of failures. Data integration tools will leverage transactions for this purpose.
The tools/techniques you use to apply incremental changes can also affect charges:
- Efficient SQL results in lower requirements for compute resources. Efficiency can be the result of smart data filters, but also taking advantage of a single merge statement instead of multiple separate insert, update and delete statements.
- Use transient objects for any data staging to avoid unnecessary charges for Time Travel and Fail-Safe.
Cloud services provide scalability for powerful technology using a pay-as-you-go model, with management provided by the vendor. Snowflake’s Data Cloud is no exception.
Do you still have (some) control over the cost you pay? It turns out you do on the data ingestion side. This post discussed strategies and technology selection criteria you may deploy to fine-tune replicating — high-value — relational database data into Snowflake.