Data / Data Science / Machine Learning

Databricks Bolsters Unity Catalog with Data Lineage

13 Jun 2022 7:44am, by
Red and blue tendrils on a black background

Data lakehouse pioneer Databricks has announced support for data lineage in its Unity Catalog. Data lineage has significant consequences for organizations employing the Databricks platform for machine learning, statistical artificial intelligence, and data governance in general.

The ability to flawlessly trace data’s journey throughout the building and deployment of machine learning models impacts everything from successful operations to regulatory compliance.

“You should be able to see how a model works,” maintained Joel Minnick, Databricks VP of Marketing. “Whether that model is generated by developing it manually, or you’re using AutoML to take you part of the way or all of the way there, it should never be opaque how that model works.”

Data lineage considerably helps data scientists understand why machine learning models generate their specific outputs. It enables users to trace everything about the data used for a specific model, from what data was involved in its training to how that data was manipulated or transformed. Moreover, data lineage provides these same benefits for the operational data models encountered during deployment.

The data provenance capabilities in the Unity Catalog facilitate these advantages in real-time so, as Minnick indicated, there’s effectively “no difference between what I’m seeing in the catalog as the lineage and history of what is happening to this data, and what is true at the moment.”

Both the upstream and downstream facets of this metadata play a pivotal role in comprehending exactly why machine learning models are producing their specific results. Traditionally, gleaning such information has never been easy.

Downstream Effects

The detailed knowledge of data’s journey throughout the enterprise is applicable to a variety of users including data stewards, data analysts, data engineers, and others. However, such traceability is particularly useful for data scientists looking to train and deploy machine learning models. The Unity Catalog is what Minnick termed “API based.” Thus, it’s “capturing all those API calls and operations happening to data through the Databricks platform, to provide all the real-time insight of what’s happening to that data,” Minnick said.

This information is invaluable for data scientists, who need to determine what data was used for a model, where it came from, and how exactly it was configured. Moreover, one of the primary distinctions of the Unity catalog is it furnishes this visibility for some of the most popular data science platforms today. “One of the things we thought about for the catalog, and data lineage as a result, is it isn’t just limited to SQL, which has been true for a lot of catalog products,” Minnick explained. “But when we start talking about notebooks, that’s folks more and more using Python, R, and Scala.”

Upstream Implications

The data lineage facilitated by Databricks’ Unity Catalog not only extends to data science notebooks, but also to dashboards and other sources like data lake houses, data warehouses, and data lakes. As helpful as traceability is for data scientists, it’s equally so for dedicated data governance personnel like data stewards, particularly when considering what Minnick called “upstream implications.” Scrutinizing such metadata enables data stewards to do impact analysis throughout the enterprise at a highly detailed level.

For example, they can see the repercussions of changing data types, dropping columns, or changing the names of rows. For specific data, stewards can “understand not only what tables has that been used in, but now also what notebooks has that been used in; what dashboards has that been used in,” Minnick revealed. “So now, I really have a sense of what business continuity issues might come out of this if I make this change in data.”

Regulatory Compliance and More

There’s no end to the utility of data provenance, especially for data lakehouse solutions such as Databricks’ and for deployments of machine learning. Although the detailed lineage that Unity Catalog now supports is not a panacea for explainability, it certainly gives data scientists a fair amount of visibility into how models create their results. That knowledge subsequently offers a blueprint for how to recalibrate models and their training data inputs to produce fair, transparent, and more accurate outputs.

“There’s been some examples in the media recently of folks who’ve had to answer that question: where did the data come from that this model was trained on?” Minnick said. “If you can’t answer that question, or you can answer that question but have a very hard time separating data that was problematic from data that wasn’t, then you have to stop using the model.”