Databricks Brings Data Pipeline Service to GA
Databricks, the cloud data platform company that coined the phrase “data lakehouse” and was founded by the creators of Apache Spark, is today announcing general availability (GA) of Delta Live Tables (DLT). DLT is a data transformation and data pipeline service that Databricks launched in preview form, in May 2021.
The New Stack was fortunate enough to be briefed on DLT’s GA by Databricks Distinguished Software Engineer Michael Armbust, who created Delta Live Tables, and Databricks CEO Ali Ghodsi. In the briefing, the two explained some of the finer points of DLT that help it avoid being “just another” ETL (extract-transform-load) solution on the market.
Hot Mess, Cool Cleanup
Let’s start by addressing the problem Delta Live Tables seeks to address. As Ghodsi describes it: “… people are… stitching together so many different things. They have the data, they use these tools to get [it] in, but then they have to use Airflow, or maybe they’re using Oozie, they’re writing a bunch of custom ETL scripts, they’re moving it into data warehouses, they’re moving it into data lakes… they have to do their own monitoring to make sure that this stuff doesn’t break… there’s just behind-the-scenes hell, that everybody has to do.”
Now contrast this with Databricks’ view of how things should be: data engineers should only have to provide a declarative specification of the data transformations they wish to perform in a data pipeline, and do it in a language they already know. Moreover, data engineers shouldn’t have to concern themselves with the logistics behind, or special performance considerations around, executing their pipelines. Instead, they should only have to define a spec; the system should then take over, managing execution on an on-demand, continuous or scheduled basis.
In a nutshell, that’s what Delta Live Tables seek to do.
Sweet Syntactic Sugar
Since Databricks thinks data engineers should be able to do data pipelines by leveraging skills they already have, DLT’s bread and butter are SQL and Python code snippets in a notebook.
On the SQL side, the output of a pipeline is defined by a query whose result set indicates an output table’s schema and content. Extensions to the SQL syntax allow specification of “expectations” — data quality rules and actions to be taken when rows of data don’t comply.
On the Python side, rather than writing imperative code, the developer leverages extensions to the DataFrame API with a declarative syntax for specifying calculations, destination table column names, filter conditions, and support for attributes that specify the same data quality “expectations” supported in SQL.
In Armbrust’s words: “In both cases… you are giving a declarative description of what tables should exist inside of your lakehouse, and then the system is figuring out how to create and keep those tables up-to-date.”
Execution Sans Naivete
Notebooks with DLT code can be scheduled as a special kind of job in Databricks, which triggers analysis of the notebook’s code and generation of an intelligent execution graph. The analysis permits parallel execution of subtasks that are determined not to have mutual dependencies and proper sequencing of subtasks that do. This allows Databricks to go beyond mere agnostic scheduling of the notebook’s code. As Ghodsi explained it, pipelines generated by other platforms whose execution might be orchestrated by Apache Airflow, for example, would not enjoy such boosted execution.
The acceleration this brings is comparable to that of conventional SQL commands executed on a database with a query optimizer. In fact, Spark SQL‘s query optimizer is responsible for generating the execution graph in the first place. This makes sense, because Armbrust also created Spark SQL. In addition, Delta Live Tables works for both streaming data and data-at-rest since Spark Streaming, also created by Armbrust, works with the same data access constructs used by the rest of the Databricks platform.
To date, most ETL implementations have involved completely code-driven efforts, or the use of a standalone ETL platform with a visual design surface. Delta Live Tables finds a middle ground, taking a code-based yet declarative approach. While the dbt platform takes a similar SQL-based declarative approach, it’s a standalone solution, whereas DLT’s engine is deeply integrated into the very same Databricks platform used for data science and analytics.
Meanwhile, there’s no reason that Databricks couldn’t create a visual designer for DLT that would generate the underlying SQL code. In fact, the Databricks workspace user interface generates a visualization of the execution graph when a job is built around a DLT notebook (as seen in the screenshot above). And while the graph visualization is a management/monitoring feature and not an authoring interface, there’s no reason it couldn’t work in both directions, generally speaking. Maybe that’s why I got the distinct feeling when speaking with Armbrust and Ghodsi that a visual designer might be on the horizon.
A Market Execution Engine, too
For now, though, Databricks is focused on making its platform an omni-data workbench and execution environment that spans data ingest, exploration, storage, transformation, analytics, data science, machine learning and MLOps. And as Databricks continues to square off with Snowflake in the battle for independent data cloud provider and ecosystem, its combination of functional breadth and technical depth makes a great deal of sense.