Machine Learning / Technology

Train, Deploy Machine Learning Models with Amazon SageMaker

2 Nov 2018 3:00am, by

This article is a part of the series where we explore cloud-based machine learning services. After covering Azure ML Services and the Google Cloud ML Engine, we will take a closer look at Amazon SageMaker.

Announced at re:Invent 2017, Amazon SageMaker is a managed machine learning service from AWS. It supports both training and hosting machine learning models in the cloud. Customers can run training jobs on clusters backed by NVIDIA Tesla K80 and P100 GPUs. The outcome of training jobs — a model ready for inferencing — can be exposed as a REST API that can deliver scalable predictions.

The service also supports hyperparameter tuning where data scientists and developers can take help of the service to find optimal parameters best suited for a given algorithm and business problem. For example, to tackle a typical regression problem, hyperparameter tuning makes guesses about which hyperparameter combinations are likely to get the best results, and runs training jobs to test these guesses. After testing the first set of hyperparameter values, hyperparameter tuning uses regression to choose the next set of hyperparameter values to test.

One of the best design decisions of Amazon SageMaker is using Jupyter Notebooks as the development tool. Given the familiarity and popularity of Notebooks among data scientists, the entry barrier is low. AWS has built a native Python SDK that can be mixed and matched with standard modules like NumPy, Pandas, and Matplotlib.

Amazon SageMaker is tightly integrated with relevant AWS services to make it easy to handle the lifecycle of models. Through Boto3, the Python SDK for AWS, datasets can be stored and retrieved from Amazon S3 buckets. Data can also be imported from Amazon Redshift, the data warehouse in the cloud. The service is integrated with IAM for authentication and authorization. Spark clusters running with Amazon EMR can be integrated with SageMaker. AWS Glue is the preferred service for data transformation and preparation.

Docker containers play a key role in SageMaker’s architecture. AWS provides container images for popular algorithms such as linear regression, logistic regression, principal component analysis, text classification, and object detection. Developers are expected to pass the location of the dataset and a set of parameters to the containers before starting the training job. However, the high-level Python API abstracts the steps involved in dealing with containers. Finally, the trained model is also packaged as a container image that is used for exposing the prediction API. SageMaker relies on Amazon EC2 Container Registry for storing the images and Amazon EC2 for hosting the models.

There are three essential components to Amazon SageMaker — Hosted Jupyter Notebooks, distributed training jobs, and model deployments that expose prediction endpoints.

Let’s take a closer look at the steps involved in training and predicting from a machine learning model deployed in Amazon SageMaker.

Data Preparation and Exploration

Amazon SageMaker expects the dataset to be available in a S3 Bucket. Before uploading the data, customers may choose to perform ETL operations in external services such as AWS Glue, AWS Data Pipeline, or Amazon Redshift.

Data Scientists can use familiar tools including Pandas and Matplotlib to explore and visualize data.

After preparing and exploring the data, the dataset is transformed into a format expected by SageMaker models. Since the platform has strong roots in Apache MXNet, is uses Tensor datatype defined in the framework. NumPy arrays and Pandas Dataframes need to be serialized into MXNet Tensors before uploading the dataset to an S3 Bucket.

Model Selection and Training

Amazon SageMaker has built-in algorithms that abstract the low-level details of training a model. Each algorithm is available as an API that takes the dataset and metrics as parameters. This removes the confusion involved in choosing the right framework for training. Once the developer decides what algorithm to use, it’s about invoking an API mapped to that specific algorithm.

Behind the scenes, SageMaker uses an Apache MXNet and Gluon framework to translate the API into multiple steps needed to create the job. These algorithms are packaged as container images stored in Amazon ECR.

Apart from Apache MXNet, SageMaker exposes TensorFlow as a native framework. Developers can write code for creating custom TensorFlow models.

It is also possible to use custom frameworks such as PyTorch and Scikit-learn. SageMaker expects these frameworks encapsulated in container images. Amazon published the prescriptive guides that contain the Dockerfile and helper scripts for creating custom images. Just before initiating a training job, using the low-level Python API, Amazon SageMaker can be pointed to the custom image instead of a built-in image.

Model Training

Amazon SageMaker’s training jobs run in a distributed environment based on Amazon EC2 instances. The API expects the number of instances along with the instance type to run the training job. For training complex artificial neural networks, SageMaker expects ml.p3.2xlarge or better instance types based on K80 or P100 GPUs.

When initiated from a Jupyter Notebook, a training job runs synchronously which displays basic progress logs and waits until training completes before returning.

Model Deployment

Deploying a model in Amazon SageMaker is a two-step process. The first one is about creating an endpoint configuration that specifies the ML compute instances that are used to deploy the model. The second step is launching the ML compute instances, deploying the model, and exposing the URI for predictions.

The endpoint configuration API accepts ML instance type and the initial count of instances. For inferencing neural networks, the configuration may include GPU-backed instance types. The endpoint API provisions the infrastructure as defined in the previous step.

Amazon SageMaker supports both online as well as batch predictions. Batch predictions use a trained model to get inferences on a dataset that is stored in Amazon S3 and saves the inferences in an S3 bucket that is specified during the creation of a batch transform job.

When compared to Google Cloud ML Engine, and Azure ML Service, Amazon SageMaker lacks the ability to use local compute resources for training and testing models. Developers are expected to create hosted notebooks, instances for training and prediction even for simple ML projects, which makes the service expensive.

Amazon is expected to announce multiple enhancements to SageMaker at re:Invent this year.

In the next part of this series, I will introduce another ML PaaS. Stay tuned!

Feature image via Pixabay.

A newsletter digest of the week’s most important stories & analyses.