Cloud Services / Kubernetes / Machine Learning

Build Repeatable ML Workflows with Azure Machine Learning Pipelines

8 May 2020 8:27am, by

Machine learning (ML) involves a complex workflow of data preparation, transformation, training, tuning, evaluation, deployment, and inference. Each step is unique and independent of the other.

For projects that deal with smaller datasets, each of the phases of the ML workflow translates to a Python function. These functions are called in a sequence to execute an end-to-end pipeline. Developer tools such as Visual Studio Code or PyCharm are used for creating the functions and the scripts that execute them.

Apart from the IDEs, Jupyter Notebooks are used by data scientists and ML engineers for all the phases of the workflow.

In production environments, a single notebook or a script running on a workstation is not an ideal environment. It may run out of resources while dealing with larger datasets and parallelized operations used in the workflow.

Image a scenario where videos from multiple cameras are collected in a central object storage bucket. First, these videos need to be split into individual frames that require massive compute power. Next, these images are normalized by applying filters and resizing them into a standard size. Then the images are transformed and passed onto a convolutional neural network (CNN) that trains a model to perform object detection.

Each stage mentioned in the above workflow demands a dedicated fleet of servers to accelerate the processing. For example, during the data preparation phase of a large dataset, it’s common to use an Apache Spark cluster. For training the model and hyperparameter tuning, a fleet of GPU-enabled virtual machines is deployed to speed up the process. For scoring the evaluating the model, a different cluster may be used. Finally, a Kubernetes cluster that runs the containerized model is leveraged for model inference.

The stages used in a pipeline or workflow have a different set of hardware and software requirements. Whether the workflow is run in a developer workstation with a smaller dataset or it is run in a cluster of GPU-enabled VMs, these dependencies are expected by the environment.

The workflow created during the training and evaluation phases may have to be reused for retraining models. This typically happens when the data source has changed significantly forcing the training and deployment of a new version of the model.

Developers and data scientists need a loosely coupled, consistent, repeatable workflow for building machine learning models. The pipelines created for the ML workflow should execute across different environments with no changes to the code. Each stage of the pipeline should be updated independently without impacting the other stages of the workflow. Finally, the pipeline should deliver consistent results every time it is run.

Overview of Azure Machine Learning Pipelines

Azure Machine Learning services is a robust ML Platform as a Service (PaaS) that has end-to-end capabilities for building, training and deploying ML models. The platform takes advantage of various Azure building blocks such as object storage (Azure Storage), block devices (Azure Disks), shared file system (Azure Files), compute (Azure VMs), and containers (Azure Container Registry, and Azure Kubernetes Service).

Azure ML pipelines provide an independently executable workflow of a complete machine learning task that makes it easy to utilize the core services of Azure ML PaaS. An Azure ML pipeline is a collection of multiple stages where each stage is responsible for a specific task. Each task is expected to do one thing and only one thing.

A task has a well-defined set of dependencies and hardware requirements that Azure ML attempts to provide. Azure Machine Learning automatically orchestrates all of the dependencies between pipeline steps. This orchestration might include spinning up and down Docker images, attaching and detaching compute resources, and moving data between the steps in a consistent and automatic manner.

Azure ML pipelines are designed to reuse the output. The runtime environment will decide which step will run and which may be reused from the previous run. This capability not only speeds up the execution but also saves the compute resources and thus the overall cost.

Pipelines are an integral part of Azure ML workspace which means they have access to the available resources such as experiments, datasets, compute, models, and endpoints. For background on Azure ML architecture and a step-by-step guide, refer to my previous article and tutorial.

Azure ML pipelines can be built either through the Python SDK or the visual designer available in the enterprise edition. The Python SDK provides more control through customizable steps.

A Closer Look at an Azure ML Pipeline

An Azure ML pipeline runs within the context of a workspace. So, the very first step is to attach the pipeline to the workspace.

We can run this code from a Notebook running in a developer workstation.

Once the workspace is configured, then the next step is to access the storage shared by all the stages of the pipeline. Each workspace has two datastores — one for flat files and the other for binary files. The default datastore is nothing but an Azure Storage blob associated with the workspace.

Once the datastore is available, it can be populated with files that act as data sources. This is done either by downloading from a public URL or uploading from a developer workstation.

From the uploaded file, we can now create a dataset that can be referenced in any of the steps of the pipeline.

We will also define an intermediary storage location used by various stages of the pipeline. For example, the data preparation step takes the above-defined dataset as input and writes to the location defined in the below snippet. This connects one stage with the other where the output of one step becomes the input for the other.

With the dataset defined, we are now ready to define the data preparation stage of the pipeline. Since it runs in the context of a compute environment, we need to first define it.

Azure ML pipelines support a variety of compute targets including Azure ML compute instance, Azure ML compute cluster, an existing Azure data science VM, Azure Databricks, Azure Data Lake Analytics, Azure HDInsight, and Azure Batch.

Any step in the pipeline can either start or reuse a compute target from the above-mentioned environments.

The below code snippet registers an Azure ML compute instance as a target if it is available or it creates one if it doesn’t exist.

We have configured storage and compute for the pipeline. It’s time to create a few steps before running it.

Assuming you have a script that can accept the dataset, transforms it, and writes to a given path, we can use that as the first step — the data prep stage.

A PythonScriptStep is a basic, built-in step to run a Python Script on a compute target. It takes a script name and optionally other parameters like arguments for the script, compute target, inputs and outputs. If no compute target is specified, default compute target for the workspace is used.

The above step mounts the dataset and passes the path as the input parameter to the script. The transformed output is written by the script to the intermediary output folder which is also mounted by the storage.

Note that prep.py is on our development workstation or an Azure VM that’s running the Notebook or Python code. It copies the file to the workspace before proceeding further.

Reuse of previous results set by allow_reuse=true is key when using pipelines in a collaborative environment since eliminating unnecessary reruns offers agility.

Assuming we defined the steps for preparation, training, and evaluation, we can construct the pipeline with them.

Finally, let’s build, validate, and submit the pipeline.

You can monitor the run in Azure Workspace through the logs written to the portal.

We can easily extend this to deploy the model and publish a REST endpoint for inference.

In the upcoming tutorials, I will show you how to build Azure ML pipelines with the Python SDK and the visual designer. Stay tuned.

Janakiram MSV’s Webinar series, “Machine Intelligence and Modern Infrastructure (MI2)” offers informative and insightful sessions covering cutting-edge technologies. Sign up for the upcoming MI2 webinar at http://mi2.live.

Feature image via Pixabay.

A newsletter digest of the week’s most important stories & analyses.

View / Add Comments

Please stay on topic and be respectful of others. Review our Terms of Use.