Series: A Close Look at Cloud-Based Machine Learning Platforms
Machine learning has become one of the key managed services offered by public cloud providers. ML as a Service (MLaaS) is built on the fundamental building blocks of the cloud — compute, storage, network, and databases. Customers signing up for ML managed services directly or indirectly consume additional services such as object storage, virtual machines, containers, data warehouse, data lakes, and business intelligence. Realizing this opportunity, cloud providers have invested in ML platforms.
This series of articles will take a closer look at the MLaaS offerings from the top public cloud providers – Amazon Web Services, Google, IBM, Microsoft, and Oracle. The objective is to map the components of the managed ML services to the typical workflow used by data scientists, developers, and DevOps in building and deploying a machine learning model. This guide doesn’t provide an evaluation or comparison of the services to stack up individual players.
Lifecycle of a Machine Learning Model
Before exploring the MLaaS offerings, let me set the stage by introducing the broad framework adopted by data scientists and developers for building production-grade ML models. After establishing the typical workflow, it becomes easy to categorize and map the components of ML platforms to each of the milestones.
Whether you are developing a computer vision AI, conversational AI, and a time-series model, there are five stages involved in machine learning:
This is the first and the most crucial part of model building. Data scientists identify various sources of data and define a mechanism to pre-process the data. This includes transforming streaming data ingested in real-time and processing historical data stored in data lakes and data warehouses. Data labeling and feature engineering take place in this phase which results in a prepared dataset with well-defined labels and features, which is critical for building a model. A variety of cloud-based services such as object storage, event streaming, data processing, data exploration are used to prepare the data.
In this phase, developers use their favorite development tools such as a Jupyter Notebook or PyCharm to write code in Python or R to apply an algorithm to the processed dataset. Developers often use a smaller subset of the original dataset to build and test the model either in the local environment or in the public cloud. During this phase, developers, data scientists, and domain experts collaborate to ensure that the feature engineering and chosen algorithms are aligned with the business goals.
Once the algorithm or the neural network architecture is decided and tested by the developer, the training phase starts. In this phase, developers experiment with critical parameters that influence the precision and accuracy of the model. For neural networks, hyperparameter tuning takes place in this phase. To assist developers in accelerating the training, ML platforms offer AutoML and auto-tuning of hyperparameters. Training typically takes place on a fleet of high-end virtual machines powered by GPUs or AI accelerators. Public cloud providers take advantage of containers and Kubernetes to orchestrate the training jobs across a GPU cluster.
Once the model is finalized and frozen, it is deployed in production as an inference service. Cloud providers expose APIs that abstract the process of containerizing the model, adding the custom inference code, deploying the container to a VM or Kubernetes cluster, and securely exposing it as an API endpoint. Developers can call the API available in the SDK to move a model from training to deployment.
The lifecycle of a model doesn’t end with deployment. Like any production API, the endpoint needs to be monitored for any errors. Specifically for ML models, we also need to monitor the quality of predictions through model drift management. By constantly comparing the predictions with an expected outcome, we can identify model decay and automatically trigger model training and deployment. Model management also includes concepts such as blue/green deployments to measure the quality of new versus old models.
After defining the key milestones and phases of the ML lifecycle management, let’s see how they are implemented by the public cloud providers.
Launched in 2017, Amazon SageMaker is one of the most comprehensive cloud-based machine learning platforms available today. Let’s understand the key components of Amazon SageMaker and how they map to the five stages discussed in the previous section.
Amazon SageMaker has multiple services to help data scientists in pre-processing the datasets. But one service that stands out is the Amazon SageMaker Ground Truth which has advanced capabilities including assistive labeling features including automatic 3D cuboid snapping and auto-segmentation. It can even do auto labeling based on a machine learning model.
SageMaker Data Wrangler reduces the time it takes to aggregate and prepares data for machine learning. It can import data from services such as S3, Athena, Redshift, and AWS Lake Formation.
Amazon SageMaker Feature Store, a recent addition to the SageMaker platform, provides a central repository for data features. Features can be stored, retrieved, discovered, and shared through SageMaker Feature for easy re-use across models and teams with secure access and control.
Amazon SageMaker Studio is an end-to-end, multi-user IDE based on the popular Jupyter Notebooks for building ML models in Python and R. Through the SageMaker Python SDK, developers can programmatically access various managed services of AWS without leaving the IDE.
SageMaker Studio is integrated with over 150 popular open source models and over 15 pre-built solutions for common use cases such as churn prediction and fraud detection. It supports mainstream deep learning frameworks including Apache MXNet, TensorFlow, and PyTorch.
Amazon SageMaker relies on containers and EC2 instances for training machine learning models. Developers can also use local compute resources from the same Python SDK.
SageMaker Experiments provide an iterative process of training models. Each experiment consists of input parameters, configurations, and results per each iteration.
SageMaker Debugger captures real-time metrics including confusion matrices and learning gradients that influence model accuracy. It can also monitor and profile system resources such as CPU, GPU, network, and memory in real-time to provide recommendations on the re-allocation of these resources.
SageMaker Studio includes JumpStart and Autopilot that brings sophisticated transfer learning techniques to developers. Jumpstart supports vision and languages based models while Autopilot is meant for structured data stored in a tabular format.
SageMaker Pipelines is one of the new features of the platform that helps users to fully automate ML workflows from data preparation through model deployment. SageMaker Pipelines comes with a Python SDK which connects to the SageMaker Studio to take advantage of the visual interface to interactively build the steps involved in the workflow.
A trained model can be deployed to a registry that maintains multiple versions of the same model. SageMaker Neo, the inference engine, can be leveraged to deploy optimized models in the cloud or an edge device. Cloud-based models are exposed through an HTTPS endpoint that serves the inference requests.
Amazon SageMaker Model Monitor is designed to detect and remediate concept drift in ML models. It automatically detects concept drift in deployed models and provides detailed alerts that help identify the source of the problem. These metrics can be integrated with CloudWatch for visualization and analysis.
In the next part of this series, we will explore Azure Machine Learning services and the Vertex AI platform from Google.