Train and Deploy Machine Models with Amazon SageMaker Autopilot

Launched at AWS re:Invent 2019, Amazon SageMaker Autopilot simplifies the process of training machine learning models while providing an opportunity to explore data and trying different algorithms. It’s an AutoML platform with a difference where you can download the data exploration and candidate notebooks that provide insights into the data preparation, feature engineering, model training, and hyperparameter tuning.
In this end-to-end tutorial, I will walk you through the steps involved in training a model based on binary classification from the SageMaker Studio IDE.
Setting up the Environment
Assuming that you have an active AWS account, follow the onboarding process of SageMaker Studio mentioned in the documentation. This creates a new IAM Role with appropriate permissions to access S3 buckets and the SageMaker environment.
When the SageMaker Studio is ready, you can access the IDE to get started with experiments.
From the IDE, launch a Python 3 notebook and rename it to acquire.pynb. We will use this notebook to acquire and split the dataset.
Start by initializing the environment by importing the modules and getting the default S3 bucket used by SageMaker Studio.
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 |
import sagemaker import boto3 import pandas as pd from sagemaker import get_execution_role region = boto3.Session().region_name session = sagemaker.Session() bucket = session.default_bucket() print(bucket) prefix = 'sagemaker/termdepo' role = get_execution_role() sm = boto3.Session().client(service_name='sagemaker',region_name=region) |
Let’s start by downloading the bank marketing dataset from Datahub.io which contains historical data of customers buying decisions of a new term deposit.
Feel free to explore the schema of the dataset explained at Datahub. The last column (Class) has a value of 1 or 2 where 1 represents a no and 2 represents a yes.
Our goal is to train a model that can predict customer decisions based on historical data.
1 2 |
!wget -N https://datahub.io/machine-learning/bank-marketing/r/bank-marketing.csv local_data_path = 'bank-marketing.csv' |
We will now split the data for training (80%) and testing (20%). For the test dataset, we will remove the label (Class) as it will be used for inferencing.
1 2 3 4 |
data = pd.read_csv(local_data_path) train_data = data.sample(frac=0.8,random_state=200) test_data = data.drop(train_data.index) test_data = test_data.drop(columns=['Class']) |
The final step is to upload the train and test datasets to the default S3 bucket.
1 2 3 4 5 6 7 8 9 |
train_file = 'train_data.csv'; train_data.to_csv(train_file, index=False, header=True) train_data_s3_path = session.upload_data(path=train_file, key_prefix=prefix + "/train") print('Train dataset uploaded to: ' + train_data_s3_path) test_file = 'test_data.csv'; test_data.to_csv(test_file, index=False, header=False) test_data_s3_path = session.upload_data(path=test_file, key_prefix=prefix + "/test") print('Test dataset uploaded to: ' + test_data_s3_path) |
You can verify the S3 bucket for the creation of train and test folders and the CSV files.
Now, we are set to create the Autopilot job to train the model.
Creating an Autopilot Experiment
Select Experiments icon in the sidebar and click on the Create Experiment button.
We need to provide the below details:
Experiment name: An arbitrary name to identify the experiment.
S3 location of input data: S3 path to the training dataset.
Target attribute name: The label used for prediction. In our case, this is ‘Class’.
S3 location for output data: S3 path for storing the models and other artifacts generated by the experiment.
Machine learning problem type: This can be left to Auto but we will choose Binary Classification.
Objective metric: The metric used to select the best candidate. We will use F1 as the metric.
Running a complete experiment: It is possible to generate just the notebooks. We will run the complete experiment for this tutorial.
Finally, click on the Create Experiment button located in the right bottom corner. This results in launching the experiment.
The pipeline will move from analyzing data, feature engineering, model tuning, and completed. The entire process takes about 60 minutes during which we can explore the trials generated by the experiment.
Once the first phase is done, we can access the candidate generation notebook and data exploration notebook.
The data exploration notebook analyzes the quality of the dataset and recommends possible actions to improve it.
The candidate definition notebook allows customizing the pipeline for each candidate and executing the workflow.
For each trial, SageMaker Autopilot creates the processing, transforming, training, and tuning jobs which are a part of the pipeline. Typically, an Autopilot experiment may generate up to 250 trials from which the best candidate can be selected. The output folder in the S3 bucket will contain the artifacts generated by each job.
You may wait till the experiment is concluded are stop it after the objective metric score has reached an acceptable level.
Deploying the Model from the Best Candidate
Choose the job marked as the best with a star next to it, and click on the Deploy Model button which will take us to the model deployment UI. Give a name to the endpoint, choose an instance type and click on the button at the bottom.
Once the model is deployed, it shows up in the Endpoints section. Wait till the status becomes InService.
Performing Inferencing with the Model
Switch to the launcher and create another notebook called infer.ipynb for performing the inference.
Import the modules and initialize the client for SageMaker runtime.
1 2 3 |
import sagemaker import boto3 from sagemaker import get_execution_role |
1 2 |
region = boto3.Session().region_name sm_rt = boto3.Session().client('runtime.sagemaker', region_name=region) |
We can now invoke the endpoint for inferencing.
1 2 3 4 5 |
l="43,technician,divorced,unknown,no,4389,no,no,telephone,2,jul,632,2,85,1,success" ep_name="term-depo-demo-1" response = sm_rt.invoke_endpoint(EndpointName=ep_name, ContentType='text/csv', Accept='text/csv', Body=l) response = response['Body'].read().decode("utf-8") print (response) |
Try experimenting with the values from the test dataset to see different classification results from the model. It’s also possible to perform batch inferencing by sending the entire test dataset. We will explore this in the next part of the tutorial on Friday where we use SageMaker Python SDK for Autopilot. Stay Tuned.
Janakiram MSV’s Webinar series, “Machine Intelligence and Modern Infrastructure (MI2)” offers informative and insightful sessions covering cutting-edge technologies. Sign up for the upcoming MI2 webinar at http://mi2.live.