Prisma Cloud from Palo Alto Networks is sponsoring our coverage of AWS re:Invent 2021.
Launched at the company’s re:Invent 2021 user conference earlier this month, Amazon Web Services‘ Amazon SageMaker Serverless Inference is a new inference option to deploy machine learning models without configuring and managing the compute infrastructure. It brings some of the attributes of serverless computing, such as scale-to-zero and consumption-based pricing.
With serverless inference, SageMaker decides to launch additional instances based on the concurrency and the utilization of existing compute resources. The fundamental difference between the other mechanisms and serverless inference is how the compute infrastructure is provisioned, scaled, and managed. You don’t even need to choose an instance type or define the minimum and maximum capacity.
Amazon SageMaker Serverless Inference joins existing deployment mechanisms, including real-time inference, elastic inference, and asynchronous inference.
The Workflow of Deploying Models in SageMaker
At a high level, there are four steps involved in deploying models in SageMaker. Let’s take a look at them.
1) Creating a Model – Whether you trained the model within SageMaker or brought an external pre-trained model, the first step is to register it with the platform. Amazon SageMaker expects the model artifact to be stored in an S3 bucket. The artifact is a tarball of a TensorFlow Saved Model, Keras HDF5, PyTorch
.pth file, or an ONNX model. The artifact is then merged with a container image with the pre-configured inference code. SageMaker provides containers for built-in algorithms and prebuilt Docker images for some of the most common machine learning frameworks, such as Apache MXNet, TensorFlow, PyTorch, and Chainer. When creating a model, the tarball is uncompressed, and the model artifacts are copied to the
/opt/ml/model directory, which is expected by the inference code. This container image becomes the fundamental unit of deployment for inference.
2) Defining the Endpoint Configuration — Once the model is registered with SageMaker, the next step is to associate it with the hosting environment defined through the endpoint configuration. It acts as the blueprint for the endpoint, which may optionally support auto-scaling. Think of the SageMaker endpoint configuration as the launch configuration of Amazon EC2 auto-scaling groups. An endpoint configuration identifies the model and the associated infrastructure, including the model variant, the GPU accelerator type such as ml.eia1.medium, and ml.eia2.xlarge, an instance type such as ml.t2.medium and ml.c5.4xlarge, and the initial number of instances.
3) Creating an Endpoint — If the previous step associated the model with the compute resources (container and instance type), this step creates the actual HTTP(S) endpoint used for invoking the model. Creating an endpoint is as simple as assigning an identifier and pointing it to the endpoint configuration defined in the previous step.
4) Invoking an Endpoint — Once the endpoint is published, it can be invoked using the Python SDK or AWS CLI. It can also be easily integrated with AWS Lambda and Amazon API Gateway to expose as a standard REST API for clients to consume.
Changes with Serverless Inference
Luckily, the workflow doesn’t change when switching between the conventional real-time inference endpoint and the new serverless inference endpoint. The key difference comes in the second step of the workflow, where we define the endpoint configuration.
Instead of manually selecting the instance type, we will let SageMaker pick the best compute resource for us. This instance type selection is done based on the minimum amount of memory mentioned in the endpoint configuration. A serverless endpoint has a minimum memory size of 1024 MB (1 GB), and the maximum size is 6144 MB (6 GB). The memory size can be 1024 MB, 2048 MB, 3072 MB, 4096 MB, 5120 MB, or 6144 MB. Serverless inference auto-assigns compute resources proportional to the memory you select. Larger memory sizes result in more vCPUs assigned to the container/instance.
You can choose the endpoint’s memory size based on the model size. The thumb rule is that the memory size should be at least as large as your model size.
The other parameter that significantly affects compute resource allocation is concurrency. Serverless endpoints have a quota for how many concurrent invocations can be processed simultaneously. If the endpoint is invoked before processing the first request, it handles the second request concurrently.
Like other serverless environments, SageMaker inference endpoints also suffer from the latency involved in cold starts. If a serverless inference endpoint does not receive traffic for a while and then it suddenly receives new requests, SageMaker will spin up the compute resources to process the incoming requests. Since serverless endpoints provision the compute resources on-demand, the endpoint may experience cold starts. A cold start can also occur if the concurrent requests exceed the current concurrent request usage. The cold start time depends on the model size, how long it takes to download the model, and the start-up time of the container with inference code.
SageMaker serverless inference endpoints don’t have programmatic access through Amazon SageMaker Python SDK during the preview. But, you can use AWS SDK for Python (Boto3) from Jupyter Notebooks to automate the creation of endpoints.
SageMaker serverless inference is available in preview in US East (Northern Virginia), US East (Ohio), US West (Oregon), Europe (Ireland), Asia Pacific (Tokyo), and Asia Pacific (Sydney).
In the next part of this series, we will look at the steps involved in publishing a SageMaker serverless inference endpoint for a TensorFlow model. Tune in tomorrow for the next installment.
The New Stack is a wholly owned subsidiary of Insight Partners, an investor in the following companies mentioned in this article: Docker.
Amazon Web Services is a sponsor of The New Stack.
Feature Image par congerdesign de Pixabay.