Technology

Tutorial: Build an End-to-End Azure ML Pipeline with the Python SDK

27 May 2020 12:52pm, by

In the third part of the series on Azure ML Pipelines, we will use Jupyter Notebook and Azure ML Python SDK to build a pipeline for training and inference. For background on the concepts, refer to the previous article and tutorial (part 1, part 2).

We will use the same Pima Indian Diabetes dataset to train and deploy the model. To demonstrate how to use the same data transformation technique used in training for inference, I will serialize the MinMaxScaler of Scikit-learn from the data preparation stage and use it for scoring.

Setting up the Environment

Start by creating a new ML workspace in one of the supporting Azure regions. Make sure you choose the enterprise edition of the workspace as the designer is not available in the basic edition.

Configure a virtual environment with the Azure ML SDK. Run the below commands to install the Python SDK, and launching a Jupyter Notebook. Start a new Python 3 kernel from Jupyter.

Next, on your development workstation, create the below directory structure:

pipeline
– data
– prep
– train
– model

Each of these directories will contain the Python scripts and artifacts used by each stage of the pipeline.

Copy the diabetes dataset in CSV format to the data directory. The final directory structure would look like the below screenshot.

Create the below script for preparing the data (prep.py) under the prep directory.

This script is responsible for the below tasks:

  • Receive the CSV file stored in the default workspace storage as an input.
  • Split the CSV file into training (77%) and training (33%) datasets.
  • Save the datasets to the default workspace storage.
  • Apply the MinMaxScaler to the training dataset.
  • Serialize and save the scaler object to the default workspace storage.
  • Log the start and end time of the task to Azure ML workspace.

Create the below training script (train.py) under the train directory.

This script is responsible for the below tasks:

  • Receive the train and test datasets stored in the default workspace storage as an input.
  • Separate the features and label from the train and test datasets.
  • Train and score the model.
  • Serialize the model file and save it to the default workspace.
  • Log the score, start, and end time of the task to Azure ML workspace.

These files will be used to build the two-step pipeline that will be executed by the Azure ML Pipelines environment.

The pipeline we are building will look like the below illustration:

Building the Pipeline

Start by creating a new Jupyter Notebook and follow the below steps. Run each of these blocks in a separate Notebook cell.

Let’s import all the modules needed by the pipeline.

Let’s configure the workspace and the default storage.

We are now ready to upload the dataset from the local directory to the workspace default storage.

We are now ready to associate the pipeline with a compute environment. The below snippet will either launch a new compute cluster or attaches itself to an existing one.

The next step is to define the software environment and dependencies for the pipeline. For this, we will build a custom Docker image with appropriate pip and Conda modules. The steps of the pipeline will leverage this image during the runtime.

We now have the compute infrastructure in place. The next step is to define the datasets that act as input and output to the stages of the pipeline.

Let’s first define the CSV file as the input dataset.

Let’s also define the intermediary datasets and the output from each step.

With everything in place, we are ready to define the data prep and training steps.

Notice how the input and output values are sent to the script.

We are all set to create the pipeline composed of the above two steps, validate it, and finally submit to Azure ML.

After you run all the cells in the notebook, you can switch the Azure Portal and see the metrics from both the steps.

Clicking on the Run ID link takes you the page where you can access the metrics and output from each step.

You can download both the scaler object and the model from Azure storage to your workstation.

With the downloaded files, you can either host the model locally or register them as models in Azure for inference.

In the next part of this series, we will explore Azure AutoML to train the model without writing code. Stay tuned!

Janakiram MSV’s Webinar series, “Machine Intelligence and Modern Infrastructure (MI2)” offers informative and insightful sessions covering cutting-edge technologies. Sign up for the upcoming MI2 webinar at http://mi2.live.

Feature image via Pixabay.

A newsletter digest of the week’s most important stories & analyses.