Delivering Production-Grade Machine Learning Outcomes with MLOps
With machine learning (ML) models taking an average of ninety days or longer to deploy, enterprises are exploring MLOps to improve deployment speed, reliability and success rates.
What Is MLOps?
Just as DevOps linked software development with operations, MLOps links two disparate areas of machine learning: development of the machine learning model and operating it in production. Linking these areas requires the right team, and automated processes around continuous integration and deployment, bringing in aspects of DevOps. MLOps also requires broad collaboration between data scientists, infrastructure experts and data engineers.
The Last Mile Problems of Machine Learning
Enterprises building out machine learning capabilities have focused on two things:
- The recruiting and training of data scientists.
- Identifying suitable business case candidates for machine learning.
While important, these are only first mile problems for any enterprise. As long as data scientists have enough relevant data for training and testing algorithms, building a predictive model is a straightforward, iterative process. The model, and its supporting code, is only a small component of the overall ensemble needed to deploy and operate machine learning at scale.
Several last-mile problems must be overcome. Data collection and verification, deployment infrastructure, observing performance of the model, model analysis, and debugging — among other components — are essential to successfully deploying ML in the enterprise.
Today, these tasks are left to data scientists and the costs are substantial. Getting ML models into production takes about ninety days on average, with only 11% of companies consistently deploying in less than seven days (source: Algorithmia’s 2021 Enterprise Trends in Machine Learning). From the same source, most data scientists also report that the process of deployment takes 25% of their time.
Why does it take so long? Putting the various pieces together goes well beyond the typical skill set of data scientists. They are not software engineers, or infrastructure and operations specialists, but deploying production quality services requires the range of expertise those roles provide.
Compounding the production problem is model decay. In traditional software applications, the application’s code determines its behavior and output. That behavior is validated with tests. In a machine learning system, data determines the behavior of the model. In the real world, the data your model consumes will drift from training data, giving unreliable results. As the data skews between training data and deployment, the model’s performance degrades over time — often with little explanation. This means that models must be constantly observed and retrained as the data drifts from expectations. The need for constant redeployment exacerbates the earlier challenges around time to deploy.
Observability and MLOps
In the same way that conventional applications are evolving beyond monitoring towards observability, ML systems must adopt the same observability capabilities. Collecting and storing the full range of logs and metrics emitted by a ML system allows data scientists and I&O engineers to quickly determine the cause of performance issues, by actively interrogating a system’s behavior. Traditional monitoring solutions fall short here, by only supporting predefined dashboards over limited data volumes.
Another factor driving observability in ML systems is understanding why a given outcome occurred, even if the data is correct. Bias in ML systems has a massive societal impact, from criminal sentencing to resume filtering. Businesses making decisions based on automated algorithms must be able to justify why and how a given outcome was reached, and that it wasn’t the result of bias. Being able to ask questions about the behavior of an ML system is essential, and that’s where observable ML systems come into play.
Structuring the MLOps Team
An effective MLOps team is cross-functional. It comprises skills and capabilities from five different roles:
Data Scientist/Machine Learning Researcher
The data scientist or ML researcher is responsible for the discovery and creation of the model using a combination of algorithms and data.
Data engineers configure and maintain data infrastructure and pipelines supporting applications and information systems. This role has evolved significantly over the last three years, expanding into a standalone function in many enterprises. You can read more about the data engineering role here.
Infrastructure and Operations Engineer
I&O engineers, also called DevOps engineers and sometimes site reliability engineers (SREs), are responsible for the reliability and resiliency of infrastructure and data. These are often the roles responsible for the deployment and monitoring of machine learning models, and the data they consume, in production environments.
Identifying the customer needs and the business case for machine learning and managing the delivery of the product is the role of the product manager. The involvement of this role is often the determining factor in creating a successful and well-integrated machine learning application. Without product management, many ML projects remain as irrelevant lab projects.
Put simply, the stakeholder is the person (or group of people) with an interest in the outcome of the ML deployment. This is commonly where the budget for the project comes from.
The Machine Learning Pipeline
With a cross-functional team in place to fund, develop and deploy models, the focus turns to the ML pipeline. The idea of an ML pipeline comes from the data engineering concept of a data pipeline. A data pipeline connects a data source and a destination and defines the data transformations in a graph of dependencies. In many cases, this is an evolution of traditional extract, transform and load (ETL), since it can go beyond batch processing to streaming and event-based consumption patterns. Data pipelines are also automated, freeing data engineers from repetitive, redundant work.
Most ML environments rely on manual processes, and for good reason. Today’s machine learning workflows consist of dozens of tools — each with their own languages, user experiences, performance characteristics, and skillset assumptions. The data used by these tools is equally diverse, residing in data warehouses, object stores and feature stores.
This is where the DevOps influence brings value to ML deployments. Given the number of tools used in machine learning pipelines, continuous integration tests must be automated across a range of tools — ensuring that every step validates its input and output, and that the next step can consume the previous output. This process must be reproducible, which is where version tracking of models and integration code comes in. Ideally, there would be solutions that allowed us to version control data as well, but the volumes used for training often make this impossible. There is no consensus yet on how to achieve data versioning for machine learning.
Like other *Ops trends getting attention from vendors and end users, MLOps requires more than just technology investment. It requires collaboration across disparate parts of the organization to ensure the right problems are being solved with machine learning. Technology provides the environment to manage and observe deployed machine learning projects in collaboration with a strong team.