When we think of machine learning, what comes to mind are the datasets, algorithms, deep learning frameworks, and training the neural networks. While they play an important role in the lifecycle of a model, there is more to it. The most crucial step in a typical machine learning operations (MLOps) implementation is deploying and monitoring models, which is often an afterthought.
A common misconception is that deploying models is as simple as wrapping them in a Flask or Django API layer and exposing them through a REST endpoint. Unfortunately, this is not the most scalable or efficient approach in operationalizing ML models. We need a robust infrastructure for managing the deployments and the inference of the models.
With containers becoming the de facto standard for deploying modern applications, the infrastructure for serving models should integrate well with the cloud native platforms such as Kubernetes and Prometheus.
What Is a Model Server?
If you have consumed cloud-based AI services such as Amazon Reckognition, Azure Cognitive Services, and Google Cloud AI Services, you appreciate those APIs’ simplicity and convenience. Simply put, a model server lets you build a similar platform to deliver inference as a service.
A model server is to machine learning models what an application server is to binaries. Just like an application server provides the runtime and deployment services for WAR/JAR files, DLLs, and executables, a model server provides the runtime context for machine learning and deep learning models. It then exposes the deployed models as REST/gRPC endpoints.
A model server is to machine learning models what an application server is to binaries.
Since a model server effectively decouples the inference code with the model artifact, it scales better when compared to a self-hosted Flask or Django web API. This decoupling enables MLOps engineers to deploy new versions of the model without changing the client inference code.
TensorFlow Serving, TorchServe, Multi Model Server, OpenVINO Model Server, Triton Inference Server, BentoML, Seldon Core, and KServe are some of the most popular model servers. Though they are designed for a specific framework or runtime, the architecture is extensible enough to support multiple machine learning and deep learning frameworks.
Model Server Architecture
A typical model server loads the model artifacts and dependencies from a centralized location which could be a shared filesystem or an object storage bucket. It then associates the model with the corresponding runtime environment such as TensorFlow or PyTorch before exposing it as a REST/gRPC endpoint. The model server also captures the metrics related to API invocation and inference output. These metrics are useful for monitoring the performance of each model and also the health of the overall model serving infrastructure.
Let’s take a look at each of the components of a model server:
The client is a web, desktop, or mobile application that consumes the model exposed by the model server through APIs. Any client capable of making an HTTP request can interact with the model server. For performance and scalability, clients can use the gRPC endpoint instead of REST. Model servers also publish client SDKs that simplify the integration of ML APIs with applications.
The model server is responsible for loading the models, reading the associated metadata, then instantiating the endpoints. It routes the client requests to an appropriate version of the model. The most important function of a model server is to efficiently manage the compute resources by dynamically mapping and unmapping the active models. For example, the model server may load and unload a model from the GPU depending on the request queue length and the frequency of invocation. This technique makes it possible to utilize the same GPU for multiple models without locking the resources.
A model server may support one or more frameworks and runtimes. It has an extensible architecture to bring new frameworks into the stack. With a pluggable architecture, it is possible to implement a new framework and runtime. For example, Nvidia’s Triton Inference Server supports multiple frameworks, including TensorRT, TensorFlow, PyTorch, ONNX, OpenVINO, XGBoost, and Scikit-learn.
The model registry is a centralized persistent layer to store model artifacts and binaries. It is accessed by the model server to load a specific version of the model requested by a client. A model registry may store multiple models and multiple versions of the same model. Each model also contains additional metadata describing the runtime requirements, input parameters, data types, and output parameters. It may optionally include a text/JSON file with the labels that can be used to associate the inference output with a meaningful label.
Though the model registry could be a directory on the filesystem, an object storage bucket is preferred. When multiple instances of the registry are run, an object storage layer serves better than the filesystem.
For a detailed explanation and a step-by-step tutorial, refer to my guide on using MinIO as the model store for Nvidia Triton Inference Server running on Kubernetes.
The model server exposes a metrics endpoint that can be scrapped by a metrics server such as Prometheus. Apart from monitoring the health of the model serving infrastructure, the metrics service can be used for tracking the API metrics such as the number of concurrent requests, current request queue, and latency.
In the upcoming articles, we will take a closer look at some of the open source and commercial model servers available in the market. Stay tuned.
AWS Cloud is a sponsor of The New Stack.
Feature Image by Peter H from Pixabay.