MLOps Needs a Better Way to Manage GPUs

GPUs are a necessity for deep learning and other large-scale forms of machine learning, yet we don’t yet have the tools to manage them effectively as we can with regular CPUs. Especially with the costs of GPUs being what they are these days, you want to make sure you get the most for your money.
Two Run.AI software engineers — Natasha Romm and Raz Rotenberg (Software Team Lead) — have been investigating ways to improve GPU utilization. They presented their finding at Kubernetes AI Day, part of the KubeCon+CloudNativeCon conference held last week in Detroit.
“GPUs must be provisioned in a smarter way,” Rotenberg said.
Today, GPUs are allocated statically, and with not much nuance, usually by user or AI workload. What is needed is a finer grain permissions, fractions of GPU time, so they can be better allocated across tasks.
“GPU provisioning is not a term we use on a daily basis,” Rotenberg admitted. As more departments start using GPUs, the admin may put in a request for more hardware. But these existing GPUs are probably way underutilized. “The truth is, you don’t always need more GPUs,” he said.
Admins should be able to overprovision, assign more workloads than there would be for GPUs to run them. Overprovisioning is routinely done with CPUs, memory and even storage. Kubernetes can’t over-provision.
Most users don’t require an entire GPU for their work. And, much of the time, they probably don’t run the GPUs at all. The researchers may be off on a coffee break, or even away for a holiday.
Kubernetes can do dynamic allocation. However, it allocates one GPU per pod. Once a GPU is a GPU is assigned to a pod, it can’t be used elsewhere, even if the GPU itself is not being used.
Tools for Better Management
The Run.AI engineers have created an open source utility, called genv (GPU Environment Management), which can be used to control, configure and monitor the GPU resources. This tool works for workloads running directly on bare metal, or ones that can be accessed over SSH.
For Kubernetes deployments, take a look at Nvidia’s DCGM Exporter, can extract operational metrics from NVidia’s GPUs. It can be run as a standalone container or deployed as a daemonset on GPU nodes in a Kubernetes cluster. It is usually deployed by Nvidia’s Kubernetes Operator, so if you use that operator, you probably already have this capability built-in.
To build GPU-monitoring dashboards, the Run.AI folks also use Prometheus and Grafana on top of the NVidia exporter. This dashboard shows the GPU usage as a percentage.
With this information, an administrator can approach the owners of the GPUs — identifiable by the K8s namespace — and ask them to relinquish the GPUs with a 0% utilization.
Further Down the Road
Genv was created as a way to introduce AI users to the idea of better managing GPUs. The next steps are smart utilization and smart provisioning, Romm said.
Run.ai has built an orchestration layer for AI resources, for managing GPUs in particular.
“We help organizations get more out of their expensive hardware using smart scheduling algorithms and deep core capabilities,” Rotenberg wrote in a follow-up e-mail.
The company engineers have built Linux-level capabilities “that allow better management of GPUs, such as memory limitation, memory swapping, time-sharing and prioritization, rerouting running pods to idle GPUs,” he wrote. The Linux-level capabilities are integrated into the Kubernetes-level layer, which provides the scheduling and other key areas of support.
The Run.AI platform also offers management features to manage AI workloads by creating projects and departments, as well as managing users and enforcing more sophisticated quotas.