Machine Learning (ML) is becoming an increasingly important part of any tech-centered company. Software made a lot of new things possible, and machine learning is bringing us to new frontiers. According to a recent survey by Algorithmia, 76% of enterprises will prioritize machine learning in their 2021 IT budgets.
The interesting part, training ML models, is often discussed, but little is shared about what is done before and after. A partial list of steps includes: collecting and storing the data, cleaning, curating, and labeling the data. Using this data: training the model, evaluating and validating it. Finally, serving the model and monitoring it.
Those steps are handled by different teams and professionals with diverse skills and backgrounds, using various tools. Miscommunication is likely, furthermore maintaining quality and replicability, while key, can be tricky. While the growth of the ML field is relatively recent, the challenges it is facing are not.
What’s the Solution?
Software development has a similar set of issues, and for the last ten years, IT organizations have been using the DevOps methodology as a solution. The term comes from Development and Operations, the historical two teams within a software department, who often struggle to collaborate. DevOps defines a set of processes to empower these teams to efficiently write, deploy and run quality software as quickly as possible.
The same practice, MLOps (Machine Learning Operations), is taking inspiration and is gradually becoming an industry standard. MLOps is a set of practices, processes, and tools designed to improve collaborations between the teams who manage the ML lifecycle. Similarly to a car assembly line, the steps to build the cars are well defined. There are multiple teams involved, coming with their tools and skills, working together toward the same goal: to make the best quality car in the least amount of time.
The purpose of an MLOps pipeline — a “machine learning assembly line” — is to define a reliable and reproducible pipeline to create, manage and serve machine learning models at scale.
How Does It Compare to DevOps?
MLOps and DevOps share many similarities, and there are two main components: the process and the professionals.
As for DevOps, MLOps will leverage Continuous Integration (CI) — the process of making sure that the code still works every time changes are pushed to the code — and Continuous Deployment (CD) — the process that ensures that this code can be deployed and run in production. ML systems require a third concept, CT (Continuous Testing) — the process of ensuring that ML models are behaving as expected.
Versioning, a key element in a software pipeline, is extended to more elements for ML processes: data, metadata, hyperparameters, code, and output needs to be versioned. We will cover that more extensively later on.
Following similar reasoning, an MLOps culture will include developers — the ones who are writing the code — and system administrations — the ones who are ensuring that the code runs properly in production. ML systems are bringing a third type: machine learning engineers and data scientists. In some way, they are the consumers of the developers or system administrators.
An MLOps Shipping Pipeline
When it comes to using an MLOps shipping pipeline, there are three main ways to go about it. The easiest is to use an end-to-end service such as the ones offered by Amazon (SageMaker), Azure (Machine Learning Studio), Watson Studio (IBM), or Google (Cloud AI).
The in-between solution is to build a pipeline by assembling several existing tools. A task orchestration and workflow tool will be the central components of it; Airflow, Luigi, and Kubeflow are among the most reliable choices.
No matter which route is picked, an MLOps shipping pipeline shall be able to manage three main parts of an ML process lifecycle:
- Data extraction, preparation, and storing.
- Model training and evaluation.
- Model deployment and monitoring.
Data Extraction, Preparation, and Storing
Data is the new oil, and it is undoubtedly a crucial part of any successful ML system. Acquiring data can be achieved in multiple ways: buy or leverage existing open-source datasets, scrape data (Scrapy, Scrapingbee, Octoparse), generate synthetic data (Scikit-learn, SymPy, OneView), or augment existing data (Labelbox, OpenCV, NLPAug).
“ML model training can be a tedious job as hundreds to thousands of attempts are often required to get the right one. A solid MLOps pipeline will allow teams to achieve this scale of iterations gracefully.” — Gabriela de Queiroz, IBM.
Before a model can consume data, it needs to be prepared. The raw data can contain errors, be corrupted, or be duplicated. Fixes include correcting, deleting, or adjusting the data. The cleaning process is also used to optimize the data that will allow for faster processing, including discretization, one-hot encoding, or power transformation of the data. Most of the time, the raw data cannot be exploited, and need to be properly augmented with metadata.
“For supervised learning models, the training data needs to be annotated,” said Manu Sharma, CEO, and co-founder of Labelbox, a Silicon Valley-based startup that provides a Training Data Platform. “Labeled data is your most important IP, so we believe you should control its creation.”
Finally, machine learning often requires a lot of data; therefore, the storing strategy needs to be well thought. Some projects have to handle multi-petabyte projects where the data is continuously updated and versioned. The tool’s choice will come down to scalability, accessibility, latency, throughput, and of course, cost. Data can flow between separate storage depending on the need. For example, the tool used for the data lake, the one used to feed the ML models, and the one used to monitor models shall probably be different.
Model Training and Evaluation
All this data can now be used to feed and train models, which come in many shapes and sizes, such as linear and non-linear models, logistic models, unsupervised models, time series models. The goal here is to tune the parameters involved with training to come up with the model that will best solve the problem.
The evaluation part is when the model is tested with a set of data that was not used for training. Getting the right model trained for the specific problems can take a lot of iterations; training hundreds or thousands of them is not uncommon. There is a feedback loop happening between the training and evaluation parts.
Here is where versioning becomes extremely valuable. Every model training, every moving element must be documented: including what data source was used and which version of that data, what hardware was utilized to process the job, and what hyperparameters were set to. The conditions in which models perform the best will be used to train the next generation of models, and so on and so forth until a good enough model is trained.
“ML model training can be a tedious job as hundreds to thousands of attempts are often required to get the right one,” says Gabriela de Queiroz, program director of open source, data and AI technologies at IBM. “A solid MLOps pipeline will allow teams to achieve this scale of iterations gracefully.”
Model Deployment and Monitoring
Once the best model is picked, it needs to be deployed for its intended use. There are countless scenarios here, including deploying the model to process data as a backend job, baking it into the firmware of a self-driving car, or pushing it to a face filter app. In every case, a process will need to adapt to the specific tools, language, and shipping requirements.
One of the last parts of an MLOps lifecycle, but not the least, is to make sure that the model keeps performing well. Unfortunately, a mode that worked well on one day may not work the day after. For example, a model predicting the stock market may be obsolete as other actors discovered the same strategy.
For a self-driving car, the road condition may have changed, snow might make the scenery look completely different, and the model needs to adapt. That is why monitoring the activity of the model in production and making sure to use this, once more, in a feedback loop to the training part of the lifecycle to ensure the models behave properly.
Every company, every product, and use case will require a different MLOps lifecycle depending on the needs. Automating a machine learning process will not only build trust, reliability, and velocity of iteration, it also allows organizations to focus on solving more complex problems.
Companies implementing MLOps pipelines will be able to come up with consistent results. The ones that don’t will waste their time and resources in the lengthy and complicated data, considering that companies already fail nearly a third of their ML projects. In his Harvard Business Review article, Labelbox’s Manu Sharma notes that “teams spend less than a quarter of their time training and iterating machine-learning models,” but are rather spending “spend more time building and maintaining the tools for AI systems.” An MLOps pipeline can avoid this time of scenario, and the best part is that these pipelines are like a LEGO, they can always be improved and more brick can be added; the sky is the limit!
Amazon Web Services is a sponsor of The New Stack.