Data / Open Source

Datafold Quashes Data Mismatches with Open Source Data Diff

23 Jun 2022 5:15am, by

Data platform software provider Datafold recently launched an open source data package that compares information in tables in different databases, and highlights the difference between them. The new release is part of the data reliability company’s ongoing efforts to deliver granular testing capabilities throughout the ELT (Extract, Load and Transform) and ETL (Extract, Transform and Load) processes for data pipelines.

Detecting even slight differences in datasets across databases is critical for data integration, data migration, and ensuring the same data used to develop applications is used in production. The open source aspect of the package means this detailed comparison of tables can be embedded in a growing number of platforms and use cases.

A data diff is the data landscape equivalent of a software diff, which compares the code of one file to another to pinpoint any differences. Data diffs provide this advantage for entire datasets — between locations — so that “we’re showing how two datasets differ that aren’t even on the same database,” remarked Matt David, Datafold director of growth.

The data diff process involves algorithms that can rapidly analyze datasets, pinpoint differences between them, and even show users where they are. This capability is beneficial for validating data, reinforcing its integrity, and ensuring its reliability when replicated across locations.

The Data Diff Need

The reasons to diff data are as varied as those for replicating data. Common use cases include high availability, disaster recovery, loading a cloud warehouse for integration, or migrating data from one repository to another. In these applications and others — particularly those involving real-time syncing between source and target systems — it’s not unusual to find changes in the datasets.

“Potentially, there’s several reasons why those datasets could no longer be matching,” David acknowledged. “One reason is a row was added to the original and hasn’t been copied over yet. Another is a row was deleted in the original and we haven’t deleted it in the new one yet, or some field was updated and we didn’t move it over.”

A data diff is a rapid way to assess whether or not there are distinctions in the two datasets and correct them. In addition to providing statistical information about these differences, Datafold’s solution identifies “exactly where the mismatches are, which allows you to spot fix those instead of having to reload everything, which is what most people do,” David said.

Checksums

Evaluating checksums is the specific algorithmic approach that provides these benefits between tables at scale. David described the checksum process as “a hashing function. It takes in whatever data is in that field, and it turns it into a number that we can then sum and then compare.”

With this approach, even words or letters are given numerical values that are swiftly added up between databases. Any difference in those totals indicates a point of distinction in the datasets. Next, the algorithms are able to locate where the difference is by repeating the checksum process on smaller and smaller segments of the respective tables until they find which rows are dissimilar.

According to David, the algorithms will denote “hey, there’s a mismatch here; cut it in half. It checks each on both sides and then you kind of binary your way down to where the exact mismatches are. It then prints those back to your [command line interface] or materializes them to your database.”

The main advantage of this method is the speed at which it takes place, which certainly is faster than replicating the entire dataset or manually checking for differences in the two locations. This approach is also fairly extensive and works for any properties in a table.

The Bigger Picture

By offering these capabilities to the open source community, Datafold appears to be democratizing the means of validating data while spurring innovation for it, too. “The open source offering is more for migration and replication purposes,” David commented. “You can embed it in a script or in [Apache] Airflow.”

Moreover, when this functionality is combined with some of the paid solutions Datafold delivers, it effectively covers the majority of the data pipelining process—specifically that pertaining to loading sources.

The versatility of this approach encompasses both ETL and ELT, despite the overall shift to the latter with the surplus of cloud data warehouses, data lakes, and data lakehouses. “Open source is the ‘EL’, the app itself is the ‘T’,” David summarized. “Our paid offering…does the transformation stuff, and then our open source offering…is really just focused on the question of replication validation or auditing replication.”