Machine Learning / Open Source / Contributed

Why We Built an Open Source ML Model Registry with git

4 Aug 2022 10:00am, by

In speaking with many machine learning teams, we’ve found that implementing a model registry has become a priority for AI-first organizations in solving visibility and governance concerns. A model registry is a centralized model store to collaboratively manage the full lifecycle of ML models. This includes model lineage and versioning, moving models between stages from development to staging to production, and model annotations and discovery (i.e., timestamps, descriptions, labels, etc.). ML teams implement a model registry solution to get centralized visibility and management of their models.

Model building and app development should be connected

But there are challenges to adopting a model registry, making it hard to build an up-to-date model registry that contains everything an organization needs. As we built out our own version of a model registry for machine learning teams, we took the below challenges into consideration. We found that to address those challenges, a model registry with a GitOps-based approach was needed.

Building a Model Registry and Its Challenges

Dmitry Petrov
Dmitry Petrov is an ex-data scientist at Microsoft with Ph.D. in computer science and active open source contributor. He has written and open sourced the first version of DVC.org, a machine learning workflow management tool. Also, he implemented the Wavelet-based image hashing algorithm wHash in open source library ImageHash for Python. Dmitry is working on tools for machine learning and ML workflow management as a co-founder and CEO of Iterative in San Francisco.

Working across Iterative customers, I noticed the problems around building a model registry arise from connections. There are three major disconnects that I see when an ML team sets up a model registry:

  1. Disconnect between model and code lineage. If a DevOps engineer updates the code of an app and forgets about the model or a data scientist runs experiments and updates a model, manual updates are required in the model registry solution or DevOps tools used by the organization, respectively. If someone forgets to update, then the link between the lineage of models and code are immediately broken. This has implications around compliance and auditing and just general operations when trying to find and manage the right models associated with specific apps.
  2. Disconnect between application deployment and model deployment. These usually end up being two different processes between MLOps and DevOps teams. The complexity, manual work, and resources involved end up impacting both teams — in fact, 87% of data science projects don’t even make it into production. With siloed deployment processes, teams must manage separate scripts and separate people with the correct expertise to successfully and effectively deploy models and apps into production. Getting to market is just that much harder!
  3. Disconnect between model registry tools themselves with DevOps tools. Model registry products are usually a different service. So teams need to set up a separate database, hardware, provision access, etc. for users… even with SaaS, there’s management overhead and the need for specific expertise in setting up and maintaining the product. And with the lack of connection, automation with workflows around deployment and training are limited.

Solution: Git as the Single Source of Truth and Using Infrastructure as Code

We found that the solution to the above challenges when building a model registry is to use git as the base because all information around code is already in it — and adding ML models and data consolidates everything together. With that, organizations can use the infrastructure as a code framework (IaC) and apply that as a model registry as code (MRaC). Just like the rest of our tools, we built our recently-released Studio Model Registry solution with this in mind.

To give a taste of what a GitOps approach looks like, I’ve detailed some of the building blocks for the model registry below.

The first are git tags for human-readable formats. As a collaboration tool, a model registry needs to contain information that’s easily understandable for everyone on the team. That’s why the use of git tags helps — checksums on various models and data turn into actual version numbers (e.g., v1.0.0) and status definitions are concrete (e.g., staging, production, etc.)

The second building block is CI/CD for production integration. Using tools like GitHub Actions or GitLab CI, ML teams can easily push models into production through CI/CD within their model registry. They can also move them between stages all in a single place, through the entire ML model lifecycle. Integration with these tools is critical to automating workflows and making them easier for both ML engineers and DevOps teams.

The final building block involves artifacts.yaml files for large/multiple models, mono-repos, and annotations. To fully connect the machine learning and modeling side of things with the application and software development side, meta-files in the format of YAML can help. For large ML models, configure with URLs, for multiple models or mono-repos, use sections, and finally, labels are used for annotations/notes around models.

Model Registry Is a Collaborative Tool

Using git as a single source of truth bridges the two worlds of software development and machine learning. Too often tools separate teams are siloed with no central visibility and management. At Iterative, we build tools on the software development stack so ML engineers live in the same world as software engineers. We find this approach helps organizations of all sizes to build models faster and more reliably.

Feature image via Pixabay.