HPC Kubernetes: AI Training on 3,500 GPUs
But with such a premium being put on GPUs for large machine learning these days, Kubernetes could provide a more dynamic way for managing vast fleets of GPUs, with the little help from tools that originated in the HPC space.
One cloud provider showing what can be done is CoreWeave, which specializes in accelerating GPU workloads.
In June, the company aced round three of the MLCommons‘s MLPerf, a benchmark test for measuring and comparing system performance on training and inferencing tasks. CoreWeave spun up a cluster of 3,500 (recently released) Nvidia H100 GPUs that trounced other Kubernetes clusters by up to a factor of 29.
“Building an ecosystem around Kubernetes makes it very easy for us to plug in new things. And get metrics out without having to build a bunch of glue between proprietary systems and Kubernetes itself,” Salanki said.
Kubernetes on Bare Metal
All the GPUs were located in a single data center: Each server houses eight GPUs on an Intel Sapphire Rapids platform. They were all tethered by 400 miles of Infiniband fiber (for lowest possible interconnective latency) and 40,000 connections.
That number is important to note because large ML workloads, which MLPerf models, could span all the GPUs available for maximum performance. But if any one of these components go down, the whole job must be restarted from the last checkpoint.
“Any individual failure can be catastrophic to a job,” Salanki said. “So ensuring that your nodes are healthy and your entire fabric is healthy. That is critical to not lose performance.”
Everything is booted statelessly — the servers do not have any operating systems on them.
“The systems are delivered without any OS. We don’t want them to come with any OS from a vendor because things change constantly. We have new kernels to deploy and new CPUs, so we can’t really expect anything that is preloaded in the factory to work,” Salanki said.
When booted, the DPU downloads a trimmed Ubuntu image with little more than GPU and Infiniband drivers, and a Kubelet. It then asks for a join token and joins a Kubernetes cluster. (The DPU also provides VPC isolation for each workload, to support a multi-tenant environment.)
“Everything is stateless,” Salanki said. “It’s fully ephemeral, which means we can plug in your notes and get them up and running on a Kubernetes cluster immediately.”
Kubernetes as the System of Record
Kubernetes serves as the system of record for each cluster, Salanki noted. Every action that happens is logged. All the performance metrics are captured.
In this setup, the Kubernetes API server is central. “Every action flows through Kubernetes. There is no path that does not go through Kubernetes,” he said. An admin that wants to reboot a node sets a condition on the node, which will trigger a reboot by the node controller. The whole flow is captured by event logging.
“By centralizing the entire management flow on Kubernetes, we can get a lot of stuff for free,” including a programming model that many developers already know, he said.
Slurm on Kubernetes
To run MLPerf, CoreWeave used Slurm, a scheduler in the HPC space well-known by researchers, though rarely used in a K8s environment.
So the company created a Helm chart for scheduling Slurm on Kubernetes (SUNK), which will be released as open source in early 2023. All the Slurm components are containerized, including the daemon, controllers and logging nodes.
With SUNK, Slurm acts as a plug-in scheduler for Kubernetes. On the same cluster, a training job could be run on Slurm, alongside long-running production inference workloads could be handled more effectively by Kubernetes itself, and could even pre-empt Slurm jobs.
In his talk, Salanki also went into detail about the two node controllers, node testing, automatic remediation for failure. Here is the full talk: