Tutorial: Manage Machine Learning Lifecycle with Databricks MLflow

In one of the past tutorials, I introduced MLflow, an open-source project from Databricks to manage, track, deploy, and scale machine learning models.
In this tutorial, I will show you how to integrate MLflow into your machine learning and deep learning projects. The environment setup is based on macOS 10.14 but can be easily extended to Microsoft Windows and Ubuntu.
Configure the Python Environment
Let’s start by upgrading the default Python environment to 3.6 and install pip and virtualenv on our machine.
Start with the installation of Homebrew.
1 |
/usr/bin/ruby -e "$(curl -fsSL https://raw.githubusercontent.com/Homebrew/install/master/install)" |
Install Python3 through Homebrew.
1 |
brew install python |
Install PIP, the Python package installer followed by virtualenv.
1 |
sudo easy_install pip |
1 |
sudo pip install virtualenv |
Finally, install Miniconda for Python 3.x by downloading the PKG file.
Now, we have Python3 with PIP, virtualenv and Miniconda on our machine. It’s time to install MLflow libraries and modules.
Install MLflow
MLflow is available as a PIP package. Run the below command to install it along with other dependencies.
1 |
pip install mlflow[extras] |
Train the Model
We will train a simple linear regression model based on the Stackoverflow salary dataset. To get started, clone the Github repository.
For the background on the dataset and the regression model, refer to this tutorial.
Navigate to the train folder to explore the Scikit-learn-based model. Notice how Mlflow is integrated into the standard linear regression training job.
We import the relevant modules of MLflow.
1 2 |
import mlflow import mlflow.sklearn |
Apart from printing the coefficients and metrics to STDOUT, we will also log them to MLFlow Tracking component.
1 2 |
mlflow.log_metric("Intercept", lm.intercept_) mlflow.log_metric("Slope", lm.coef_[0]) |
1 2 3 |
mlflow.log_metric("MAE", mae) mlflow.log_metric("MSE", mse) mlflow.log_metric("RMSE", rmse) |
Finally, we log the model file serialized as a Python PKL file to MLflow.
1 |
mlflow.sklearn.log_model(lm, "model") |
From the root directory of the experiment, run the training job by passing the dataset as a parameter.
1 |
python train/sal_train.py data/sal_data.csv |
The training job has logged all the coefficients and metrics to MLflow along with the final model.
Track the Project
MLflow comes with a powerful dashboard to visualize and track the metrics generated by each run.
Access the dashboard at http://localhost:5000 by launching the MLflow UI from the command line.
1 |
mflow ui & |
This dashboard is extremely useful in tracking hyperparameters used in deep learning. You can query, filter, and also download the metrics in a CSV file.
Packaging the Training Job
Now that we have our training code, we can package it to make it easy for other team members can consistently replicate the environment. Once packaged, training jobs can be deployed in cloud platforms and run remotely.
There are two files that carry the packaging information – MLproject and conda.yaml. The first one defines the entry point and parameters of the job while the second file contains the dependencies and modules needed by the training script.
1 2 3 4 5 6 7 8 9 |
name: salary conda_env: conda.yaml entry_points: main: parameters: data_file: path command: "python sal_train.py {data_file}" |
1 2 3 4 5 6 7 8 9 10 11 |
name: salary channels: - defaults dependencies: - pip - python=3.6 - numpy=1.14.3 - pandas=0.22.0 - scikit-learn=0.19.1 - pip: - mlflow |
These two files exist in the same directory as the training script.
We can now turn this into a Python virtualenv and run the job with the below command:
1 |
mlflow run train -P data_file=data/sal_data.csv |
Notice how the dataset is passed as a parameter via the mlflow run command. Each time mlflow runs, it ensures that the dependencies as defined in the conda.yaml are met. If there is no previous virtualenv for the run, a new one is created to install the dependencies.
Serving the Model
MLflow comes with in-built model serving mechanism that exposes the trained model through a REST endpoint.
An MLflow Model is a standard format for packaging machine learning models that can be used in a variety of downstream tools — for example, real-time serving through a REST API or batch inference on Apache Spark.
We will now configure model serving that can predict salary based on the years of experience.
In the training code, after training the linear regression model, a function in MLflow saved the model as an artifact within the run. To view this artifact, we can access the UI again.
Let’s point MLflow model serving tool to the latest model generated from the last run. We will also explicitly mention the port number 5050 for the REST endpoint.
1 |
mlflow models serve -m mlruns/0/f7aa700cc2cb4d1f98ea2a2fa6486a4b/artifacts/model -p 5050 |
Once this service is listening, we can invoke it by sending the datapoint as a JSON payload.
1 2 3 4 5 |
curl -X POST -H "Content-Type:application/json" \ --data '{"columns":["x"],"data":[[22]]}' \ http://127.0.0.1:5050/invocations [143600.7452574526] |
The above output shows the predicted salary of a developer with 22 years of experience.
MLflow is a flexible toolkit to manage, track, deploy, and scale machine learning projects, experiments, and models.
Janakiram MSV’s Webinar series, “Machine Intelligence and Modern Infrastructure (MI2)” offers informative and insightful sessions covering cutting-edge technologies. Sign up for the upcoming MI2 webinar at http://mi2.live.