A Close Look at Cloud-Based Machine Learning Platforms: Microsoft Azure ML, Google Vertex AI
This is the second part of the ML PaaS series where we explore Azure Machine Learning services and Google’s Vertex AI platform.
We follow the same framework of classifying the features and services of these platforms into the five stages of machine learning.
Azure Machine Learning is one of the first cloud-based ML PaaS. Since its launch in 2016, Microsoft has been adding many new features and capabilities to the Azure ML service. In its current form, Azure ML is one of the most complete and robust ML platforms available in the public cloud.
Azure ML Studio delivers the user experience for managing end-to-end machine learning tasks within Azure Portal.
Data in Azure ML can be ingested through Azure Data Factory from a variety of data sources including Azure Blob Container, Azure Data Lake, Azure SQL Database, and Databricks File System. Once ingested, datasets can be easily processed through the Pandas module. For large datasets that need parallelization, Apache Spark, Modin, or Dask can be used for preprocessing.
Azure ML has a concept of a workspace that contains all the assets of a machine learning project including datasets, notebooks, models, and deployments. Data scientists and developers can launch a Jupyter Notebook within a workspace. Azure ML’s Python SDK provides programmatic access to storage, compute, and other managed services.
Azure ML Studio comes with a low-code/no-code environment based on a visual designer for building ML pipelines. With a drag-and-drop interface, the visual designer simplifies building ML models.
Azure ML decouples storage and compute resources from the training environment. Once a processed dataset is accessed and loaded, a pre-provisioned compute cluster can be leveraged to start the training job.
Azure ML also supports using compute resources on the developer workstation or a remote data center for training.
Each training job runs in the context of an experiment. An experiment contains one or more runs of a training job. Each run represents one iteration of the training which results in a set of metrics and a trained model. Developers can choose the best model from one of the runs based on the metrics such as accuracy and precision.
Azure ML can launch a cluster of Azure VMs running high-end CPU and GPU infrastructure. This cluster can be accessed from the workspace to schedule training jobs.
The AutoML capabilities of Azure ML provide automated feature engineering, model selection, and hyperparameter tuning. AutoML in Azure supports classification, regression, and forecasting tasks.
Trained machine learning models are deployed as web services in the cloud or locally. Azure ML customers can also deploy models to Azure IoT Edge devices. Deployments use CPU, GPU, or field-programmable gate arrays (FPGA) for inferencing.
Azure ML takes advantage of containers and Kubernetes to deploy models. Models can be exposed as web services running in Azure Container Instances, Azure Kubernetes Service, or local compute environments.
Deployment can be initiated through the Python SDK, Azure Portal, or the CLI.
AzureML supports the detection of both data drift and model drift.
Model monitoring enables customers to understand what data is being sent to the model, and the predictions that it returns. Data can be collected from models deployed in Azure Kubernetes Service or Azure Container Instances which can be used to detect model drift.
Collected data and metrics can also be sent to Azure Application Insights for real-time monitoring.
At the recently held I/O 2021 conference, Google launched Vertex AI, a revamped version of ML PaaS running on Google Cloud. Vertex AI brings multiple AI-related managed services under one umbrella. Google Cloud has two different AI services — AutoML and custom model management that was offered through the Cloud AI Platform. Apart from the unification of these two services, Vertex AI adds brand new features, including Edge Manager, Feature Store, Model Monitoring, and Vizier. The new capabilities plug critical gaps that existed in Google’s AI portfolio.
Let’s find out more about Vertex AI.
Vertex AI has a unified data preparation tool that supports image, tabular, text, and video content. Uploaded datasets are stored in a Google Cloud Storage bucket that acts as an input for both AutoML and custom training jobs.
AI Platform Data Labeling Service lets customers work with human labelers to generate highly accurate labels for a collection of data that you can use in machine learning models. The workflow for labeling service involves upload the raw dataset, a label set, and instructions for the human labelers to identify and apply labels to the dataset.
Vertex AI has Python SDK that can be used to access the datasets from a Jupyter Notebook, Colab environment, or even an on-premises environment.
Vertex AI provides Docker container images that developers run as prebuilt containers for custom training. These containers, which are organized by machine learning (ML) framework and framework version, include common dependencies that can be used in training code. It’s also possible to build a custom container and uploading it to Google Container Registry. A custom job specification in Vertex AI includes the collection of the dataset, prebuilt or custom container, and a Google Compute Engine machine type, and custom training code.
Vertex AI comes with an integrated Jupyter Notebook environment that can be based on a GCE instance backed by a GPU. The notebook environment may be created based on prebuilt or custom container images.
Vertex AI provides Docker container images that developers run as prebuilt containers for serving predictions from trained model artifacts. These containers, which are organized by machine learning (ML) framework and framework version, provide HTTP prediction servers that can be used to serve predictions with minimal configuration.
Google providers prebuilt containers for TensorFlow, XGBoost, Scikit-learn prediction. For other frameworks, developers can build a custom prediction container image. Before publishing as an endpoint, developers will upload the container image along with inference code to the platform. The inference code performs and the preprocessing and post-processing of the data sent to and received from the model.
The container image and the inference code are scheduled in a Compute Engine VM for performing online or batch predictions.
Deployment can be done from the Google Cloud Console or through the Python SDK of the Cloud AI Platform.
Vertex AI supports traffic splitting for performing A/B tests on two different versions of the same model. This is helpful in evaluating the accuracy of new models before completely switching from the old model.
After a model is deployed in production, there are often changes in the input data provided to the model for predictions. When the prediction input data deviates from the data that the model was trained on, the performance of the model can deteriorate, even though the model itself has not changed.
Vertex AI Monitoring supports feature skew and drift detection for categorical and numerical features. Training-serving skew occurs when the feature data distribution in production is different from the distribution of feature data that was used to train the model. Prediction drift occurs when feature data distribution in production changes significantly over time, which affects overall model performance.
Vertex AI exports model prediction metrics to Cloud Monitoring. Customers can use Cloud Monitoring to create dashboards or configure alerts based on the metrics. Predictions per second, prediction error percentage, total latency duration are some of the metrics captured by model monitoring.
In the last and final part of this series, we will take a look at IBM Cloud Pak for Data and Oracle Machine Learning. Stay tuned.