Data / Machine Learning / Monitoring

Databand: Observability for Data Pipelines

27 Apr 2020 12:17pm, by

For all the buzz around observability these days, that’s largely focused on infrastructure operations, while data engineers are left to plug gaps in knowledge around how data pipelines work — or don’t.

Application performance management (APM) offerings and typical monitoring tools don’t provide the kind of insight that data engineers need into their pipelines, according to Josh Benamram, CEO at Databand.

They will have their cloud, Docker and Kubernetes running, then on top of that, the more specialized tools for processing data: different tools for streaming and for batch processes, such as Spark, Presto or Apache Airflow, he said.

“You can think about data in a common data pipeline in a business would be something like every single day at 12 p.m. I run this process if  I’m a financial services company, “ he explained.

“I run this process which takes in enormous amounts of data from 20 different exchanges, like stock exchanges, like NYSE, NASDAQ, whatever pulls in that data into my system every day … runs an hours-long process to aggregate extract features from the data, cleanse the data, pull it together into a single location that my data scientists use. The underlying infrastructure that powers something like that would be your cloud environment, maybe Kubernetes, Apache Airflow for scheduling your run every day at 12 p.m., Spark for prod, for ingesting the data at scale and doing large-scale processing, and then something like Snowflake or BigQuery, or Redshift for delivering the data into some data lake that other teams can use. So it’s just another order of complexity than what software engineers normally will work with.”

He explains in a blog post that typical APM tools focus on metrics logs and traces, while data pipeline monitoring also requires insight into data flows are there issues in data quality? — the schedules on which batch processes run and the internal and external dependencies that link pipelines together.

“If you were running a data pipeline every day at 12 p.m. There’s the ‘ephemeral-ness,’ if that’s a word, of that process. The fact that it runs as a long-running batch process creates nuances around how you monitor it relative to a normal microservice or application, which is supposed to run 24/7 all the time, no downtime ever.

“If I’m running a batch process, it’s totally normal for a batch process to fail five, or six or seven or 10 times before it kicks on and successfully runs. And it’s very normal for these batch processes to have a really complicated web of dependencies,” he said.

“I might have one pipeline that delivers data into one location, and then another pipeline that reads the data from that place and delivers it to another place. You can just imagine this big web growing. And when you think about all those nuances, it really creates a huge need for a dedicated tool that understands this stuff.”

Tracking Data Pipelines

Databand co-founders Evgeny Shulman, Benamram and Victor Shafran met at tech community events in Tel Aviv. They launched the company in 2018. It’s based in New York, with engineering and R&D still centered in Tel Aviv.

They have released DBND an open source framework for building and tracking data pipelines. It includes a Python library, a set of APIs, and CLI that can be used for data ingestion, preparation, machine learning model training and production. DBND can be used as an orchestrator for systems such as Airflow, providing deep tracking of pipeline metadata and decoupling of code from underlying compute and data systems. DBND requires Python 2.x or 3.x and supports Windows, macOS and Linux.

The Databand offering is billed as an observability solution plugged into the open source data ecosystem, providing deeper understanding of infrastructure performance, how much it’s costing, and how accurate the data is.

“Databand was founded by experienced data scientists and software developers who vividly appreciate the pain points of data science project management.  Their backgrounds make them uniquely positioned to help data engineers and data scientists be more productive and more effective in using data in enterprise research and production environments,” David Magerman, managing partner and Chief Technology Officer at Differential Ventures, one of Databand’s investors, said of the company in an email.

Among the issues Shulman points out in a post, is that because many data processes are long-running, failure toward the end can be costly as jobs must be restarted from the beginning.

The main competition to Databand is the in-house work that organizations cobble together to connect their data pipeline infrastructure and some standard monitoring solution, Benamram said.

Databand is specifically designed for data with integrations with Airflow, Databricks, Spark, Kubernetes, MLflow and other tools.

“The second element that makes us different is we collect different kinds of information from these processes. Examples would be collecting a lot more metadata about the scheduler that you’re using to run your pipelines, collecting more metadata from the engine, like Spark, and collecting more metadata about the data itself, the actual structure of the data that you’re operating on within your pipelines. And if you wanted to meet the same kind of monitoring with a standard monitoring tool, there would just be a lot of middleware and logging that you would need to build as a company to get to the same place that we do out of the box,” he said.

The third differentiator is that monitoring information is presented in the context of the pipeline itself, he said.

“Within our system, you see your data pipeline, you see the different nodes of the data transformation tasks that you’re running. And you open up those nodes to understand what the data lineage looks like and how your data structure might be changing between runs of this process, and whether there’s problems in the data set or problems in the code that you’re executing within the pipeline,” he said.

The world is still talking a lot about machine learning and AI, he said, and it is seeing those use cases.

“The most advanced teams in the world are building up systems now to do automated maintenance of their machine learning models through retraining processes. And we will help them to monitor and observe those kinds of systems,” he said.

“But we also see a lot of just standard ETL cases … A lot of the world is still maturing those classic ETL cases where you just need to make sure that you’re taking in data from as many sources as you need, transforming the data in the right way, and delivering it to the people that need it and making sure that all this system is held to the right quality standard.”

Image by JuraHeep from Pixabay.

A newsletter digest of the week’s most important stories & analyses.