Data / Development / Machine Learning Git-Based Machine Learning Tools for ML Engineers

4 Feb 2021 9:10am, by

While working as a data scientist at Microsoft, Dmitry Petrov decided that big, monolithic data platforms weren’t the way to go. There needed to be tools built on top of platforms, they needed to be open source and that machine learning engineers had particular needs not being met.

His solution was creating of, a San Francisco-based startup focused on managing machine learning models. Its two products DVC (Data Version Control) and CML (Continuous Machine Learning) aim to bring engineering practices to data science and machine learning.

In the ever-growing ecosystem of DataOps enterprise software vendors, including DVC joins the likes of TerminusDB, Dolt and Pachyderm with the aim to bring a Git-like experience to data science, but Petrov says the focus of DVC is narrow — versioning data and ML models.

In managing their data, companies initially decide they need to move it around, to colleagues’ laptops, to the cloud, to production systems, Petrov said. They need to know they’re working on the right version, especially when training a model.

“Our focus is ML modeling, ML process, so we can help people to build models, to share the model between the team, to collaborate on the model,” Petrov said.

An O’Reilly report on 2021 trends cites a lack of adequate tools for versioning data (though it calls DVC a start), as well as a lack of adequate tools for versioning models (though there it points to tools MLflow as a start).

Petrov said the needs of data analysts and data scientists and those of ML engineers are different, and the continuous integration/continuous delivery tools of the software engineering stack don’t necessarily meet those needs. Rather than build out a separate platform, however, he decided to build on top of GitHub, GitLab and more recently BitBucket.

ML engineers, he said, tend to work with unstructured data — images, videos, text — while data scientists usually work with structured data, often from a data warehouse.

“ML engineers, they do write a code. Their models are usually complicated. They work in a team,” he said. “Data scientists and data analysts they work for usually in a relatively small project, like maybe two days, maybe one week at the best. They don’t need any advanced collaboration tool.

“ML engineers, they still need collaboration. They need GitHub for collaboration, they need this CI/CD system to resolve [issues] between each other, between the team and production system,” he said.

That’s where DVC and CML library comes in.

DVC offers a way to track changes in data, source code and ML models together to provide a single history of a project. It enables users to track the evolution of experiment, reproduce projects without model retraining and share projects.

Built on top of git, users create lightweight metafiles that describe the ML artifacts to track. That enables the system to use this metadata to handle large files, rather than storing them in Git. DVC relies on remote storage for large files in the cloud — S3, Azure, Google Cloud, etc. — or on-premise network storage (via SSH, for example). They’re treated as a key-value store, employing hardlinks/symlinks instead of copying files.

Versions of the data and models are stored as Git commits, enabling users to create snapshots, restore previous versions, reproduce experiments, and more They can manage experiments with Git tags/branches and metrics tracking.

DVC defines rules and processes for collaborating as a team and for running a finished model in production. With push/pull commands, you can consistently move ML models, data and code into production or to other locales.

Lightweight pipelines connect versioned data sets, models and code. Pipelines are treated as a first-class citizen. They are language-agnostic and connect multiple steps into a directed acyclic graph (DAG).

DVC can mark a certain stage outputs as metrics that can be used to help users compare models and data sets across versions. The plots feature displays the metrics in visual form.

Just as DVC is an extension of Git-LFS, CML is an extension of GitLab CI/CD.

While some data engineering tools are “more focused on reliability and distributed data processing, our scenario is way more lightweight. … This is this scenario and navigation around models, when you build like 20 versions of your models, how you can find the best one? What does it mean to have a best model? Sometimes it’s not failure, sometimes it doesn’t mean like the best score or something, you need to have kind of a picture of what’s going on and how to find the best model in your repository. So this is another functionality that we need on top of Git,” Petrov said.

CML is a library to automate machine learning workflows, including model training and evaluation. With CML, you can run reports comparing the current model, the production model and spot differences with the master model or at any stage of your project history, as well as monitor changing datasets. It will auto-generate reports with metrics and plots in each Git pull request.

In addition to its open source projects, has built enterprise features, such as enhanced security, and will be unveiling a SaaS product combining collaboration and visualization on top of DVC and CML in the next month or so, Petrov said.

Feature image by Gerd Altmann from Pixabay.

A newsletter digest of the week’s most important stories & analyses.