Databricks MLflow Aims to Simplify Management of Machine Learning Pipelines
Machine learning is one of the most popular technologies of this decade. But, along with the growing acceptance and adoption of ML, the complexity involved in managing ML projects is also increasing proportionally.
Unlike traditional software development, ML is all about experimentation. For each stage of the ML pipeline, there is a plethora of tools and open source projects available. Developers and data scientists experiment with multiple tools before settling for the best. The training process, hyperparameter tuning, scoring, and evaluation of a model are often repeated until the results are satisfying.
The development, training, and inference environments need to be consistently setup across a variety of environments. Starting with Python and R runtimes to the collection of modules, each environment has to run a specific version of the language, runtime, frameworks, and tools. The environment should be consistently made available across development machines, CPU-based VMs in the data center, and GPU-backed VMs in the public cloud. A minor difference of version in one of the dependent modules can create havoc during training and deployment.
MLflow from Databricks is an open source framework that addresses some of these challenges. The project aims to ease the pain involved in configuring environments, tracking experiments, and deploying trained models for inference.
Recently Databricks released MLflow 1.0, which is ready for mainstream usage. There is also a managed version of the MLflow project available in AWS and Azure.
MLflow is available for both Python and R environments. The framework can be easily installed with a single Python pip command on Linux, Mac, and Windows OS. Once installed, the API can be easily integrated with existing and new ML projects based on popular frameworks including Scikit-learn, TensorFlow, Caffe2, PyTorch, MXNet, CNTK, and ONNX.
MLflow is a collection of three components: Tracking, Projects, and Models. Let’s take a closer look at them.
MLflow Tracking component is an API and UI for logging training parameters, code versions, metrics, and output files when running machine learning code and for later visualizing the results. It also comes with a minimal user interface to visualize the metrics in a dashboard.
The tracking API can be consumed in Python, R, and Java. For other languages, there is a REST API that can be invoked using standard HTTP libraries.
Developers can create one or more experiments with individual runs. Each run can record code version, start and end time, source file name, metrics sent as arbitrary key/value pairs, and even artifacts such as datasets, serialized objects and trained models.
The tracking API can log the runs to to local files, to a SQLAlchemy compatible database, or remotely to a tracking server. By default, the MLflow Python API logs runs locally to files in an mlruns directory wherever program is executed. The logs can be sent to a remote server by configuring the MLFLOW_TRACKING_URI. Artifacts can be redirected to remote object storage and file servers such as Amazon S3, Azure Storage, Google Cloud Storage, FTP, NFS, and HDFS.
MLflow Tracking is a valuable tool for teams and individual developers to compare and contrast results from different experiments and runs.
MLflow Projects are a standard declarative format for packaging reusable data science code. A directory or a Github repo can contain a YAML file with the definition of an environment.
The project may contain a file named MLProject which has a pointer to the standard conda.yaml file. MLflow relies on Conda to create consistent and repeatable environments. After defining the dependencies through conda.yaml, the project file will also contain an entry point which is typically the training job.
When a project is executed through the CLI, MLflow first configures the Conda virtual environment as defined in conda.yaml. It then activates the environment and executes the job mentioned in the entry point. This process can be repeated in multiple environments such as the local data center and public cloud.
The project can also be built as a Docker container image by including an image definition which becomes the base image for the containerized environment. Each project file may also contain multiple entry points to create a multi-step workflow. This feature is useful when evaluating multiple algorithms for the same problem performing hyperparameter tuning.
MLflow Projects deliver consistent, idempotent, and repeatable environments for data science and machine learning projects.
MLflow Models simplify inference through a consistent model serving mechanism. It is a standard format for packaging machine learning models that can be used in a variety of downstream tools such as Apache Spark.
Each MLflow Model is a directory containing arbitrary files, together with an MLmodel file in the root of the directory that can define multiple flavors that the model can be viewed in. The flavor is associated with a specific framework such as Scikit-learn. It can also target ML platforms in the public cloud such as Amazon SageMaker and Azure ML. All of the flavors that a particular model supports are defined in its MLmodel file in YAML format.
MLflow provides several flavors for serving models generated from mainstream frameworks including TensorFlow, Spark MLlib, PyTorch, Keras, and ONNX. A standard Python or R function may be used for performing inference based on NumPy or Pandas data points.
MLflow can deploy models locally as local REST API endpoints or to directly score files. In addition, MLflow can package models as self-contained Docker images with the REST API endpoint. The image can be used to deploy the model to various environments such as Kubernetes and Mesosphere.
MLflow Models API brings a consistent mechanism to deploy models for inferencing.
In one of the upcoming tutorials, I will walk you through the steps involved in integrating MLflow with an existing machine learning project. Stay tuned.
Janakiram MSV’s Webinar series, “Machine Intelligence and Modern Infrastructure (MI2)” offers informative and insightful sessions covering cutting-edge technologies. Sign up for the upcoming MI2 webinar at http://mi2.live.
Feature image by Skitterphoto from Pexels.