Delta Lake: A Layer to Ensure Data Quality
One of the Linux Foundation’s newest projects, called Delta Lake, aims to ensure the reliability of data across data lakes at a massive scale. These Big Data systems most commonly are used for machine learning and data science, but also for business intelligence, visualization and reporting.
With multiple people working with data in a data lake at the same time, it’s easy for problems like incomplete transactions or multiple simultaneous updates to bring the quality of the data into question.
“Delta Lake enables you to add a transactional layer on top of your existing data lake. Now that you have transactional transactions on top of it, you can make sure you have reliable, high-quality data, and you can do all kinds of computations on it. You can, in fact, mix batch and streaming. … Because the data is reliable, It’s OK to have someone streaming in data while someone else is in batch reading it,” Ali Ghodsi, co-founder and CEO of Databricks explained at Spark+AI Summit Europe.
Delta Lake provides ACID transactions, snapshot isolation, data versioning and rollback, as well as schema enforcement to better handle schema changes and data type changes.
Databricks open sourced the technology in April under the Apache 2.0 license.
Companies using it in production such as Viacom, Edmunds, Riot Games and McGraw Hill. Alibaba; Booz Allen Hamilton, Intel and Starburst Data, which are collaborating with Databricks on support also for Apache Hive, Apache NiFi, and Presto.
There are other ways to add transactional support to data lakes. Cloudera’s Project Ozone takes a similar tack, and there’s Hive for HDFS-based storage.
It’s not a storage system, per se, but sits atop your existing storage, like HDFS and cloud storage like S3 or Azure blob storage. It provides a bridge between on-prem and cloud storage systems.
It can read from any storage system that supports Apache Spark’s data sources and can write to Delta Lake, which stores data in Apache Parquet format. All transactions made on Delta Lake tables are stored directly to disk.
Central to Delta Lake is the transaction log, a central repository that tracks all changes that users make. It records as a JSON file every change in the order they are made. If someone makes a change, but then deletes it, there still will be a record of that to simplify auditing.
It provides atomicity, recording only transactions that execute fully and completely, to ensure the trustworthiness of the data.
Just as multiple people can work on a jigsaw puzzle by tackling different areas of it, Delta Lake is designed to enable multiple people to work on the data at once without stepping on each others’ toes.
When dealing with petabytes of data, most likely those users will be working on different parts of the data. If, for instance, two changes do happen simultaneously, it relies on optimistic concurrency control, a protocol in which the data remains unlocked, to settle the matter.
It also offers a “time travel” or data-versioning feature, enabling users to focus on a specific point in time. After 10 commits to the transaction log, Delta Lake saves a checkpoint file in Parquet format. Those files enable Spark to skip ahead to the most recent checkpoint file, which reflects the state of the table at that point.
Delta Lake supports two isolation levels: Serializable and WriteSerializable. Stronger than Snapshot isolation, WriteSerializable offers the best combination of availability and performance and is the default. The strongest level, Serializable ensures the serial sequence matches exactly that shown in the table’s history.
The Linux Foundation is a sponsor of The New Stack.
Image by DreamyArt from Pixabay.