Favorite Social Media Timesink
When you take a break from work, where are you going?
Video clips on TikTok/YouTube
X, Bluesky, Mastodon et al...
Web surfing
I do not get distracted by petty amusements
Hardware / Kubernetes

HPC Kubernetes: AI Training on 3,500 GPUs

K8s brings many advantages to managing fleets of GPUs, said CoreWeave's Peter Salanki, during a talk at KubeCon+CloudNativeCon 2023.
Dec 4th, 2023 10:20am by
Featued image for: HPC Kubernetes: AI Training on 3,500 GPUs
“By centralizing the entire management flow on Kubernetes, we can get a lot of stuff for free,” — Peter Salanki, CoreWeave

To date, Kubernetes has largely steered clear of the high-performance computing (HPC), or supercomputing space.

But with such a premium being put on GPUs for large machine learning these days, Kubernetes could provide a more dynamic way for managing vast fleets of GPUs, with the little help from tools that originated in the HPC space.

One cloud provider showing what can be done is CoreWeave, which specializes in accelerating GPU workloads.

In June, the company aced round three of the MLCommons‘s MLPerf, a benchmark test for measuring and comparing system performance on training and inferencing tasks. CoreWeave spun up a cluster of 3,500 (recently released) Nvidia H100 GPUs that trounced other Kubernetes clusters by up to a factor of 29.

Unlike traditional high-performance computing (HPC) systems, CoreWeave does not run on services on bare metal but rather uses Kubernetes over the bare metal.

K8s brings many advantages to managing GPUs, said Peter Salanki, CoreWeave director of engineering, during a talk at KubeCon+CloudNativeCon 2023.

“Building an ecosystem around Kubernetes makes it very easy for us to plug in new things. And get metrics out without having to build a bunch of glue between proprietary systems and Kubernetes itself,” Salanki said.

Kubernetes on Bare Metal

All the GPUs were located in a single data center: Each server houses eight GPUs on an Intel Sapphire Rapids platform. They were all tethered by 400 miles of Infiniband fiber (for lowest possible interconnective latency) and 40,000 connections.

That number is important to note because large ML workloads, which MLPerf models, could span all the GPUs available for maximum performance. But if any one of these components go down, the whole job must be restarted from the last checkpoint.

“Any individual failure can be catastrophic to a job,” Salanki said. “So ensuring that your nodes are healthy and your entire fabric is healthy. That is critical to not lose performance.”

Everything is booted statelessly — the servers do not have any operating systems on them.

“The systems are delivered without any OS. We don’t want them to come with any OS from a vendor because things change constantly. We have new kernels to deploy and new CPUs, so we can’t really expect anything that is preloaded in the factory to work,” Salanki said.

Each server comes with a Nvidia Bluefield Digital Processing Unit (DPU), a processor on a network card (also managed by Kubernetes).

When booted, the DPU downloads a trimmed Ubuntu image with little more than GPU and Infiniband drivers, and a Kubelet. It then asks for a join token and joins a Kubernetes cluster. (The DPU also provides VPC isolation for each workload, to support a multi-tenant environment.)

“Everything is stateless,” Salanki said. “It’s fully ephemeral, which means we can plug in your notes and get them up and running on a Kubernetes cluster immediately.”

Kubernetes as the System of Record

Kubernetes serves as the system of record for each cluster, Salanki noted. Every action that happens is logged. All the performance metrics are captured.

In this setup, the Kubernetes API server is central. “Every action flows through Kubernetes. There is no path that does not go through Kubernetes,” he said. An admin that wants to reboot a node sets a condition on the node, which will trigger a reboot by the node controller.  The whole flow is captured by event logging.

“By centralizing the entire management flow on Kubernetes, we can get a lot of stuff for free,” including a programming model that many developers already know, he said.

Slurm on Kubernetes

To run MLPerf, CoreWeave used Slurm, a scheduler in the HPC space well-known by researchers, though rarely used in a K8s environment.

So the company created a Helm chart for scheduling Slurm on Kubernetes (SUNK), which will be released as open source in early 2023. All the Slurm components are containerized, including the daemon, controllers and logging nodes.

With SUNK, Slurm acts as a plug-in scheduler for Kubernetes. On the same cluster, a training job could be run on Slurm, alongside long-running production inference workloads could be handled more effectively by Kubernetes itself, and could even pre-empt Slurm jobs.

In his talk, Salanki also went into detail about the two node controllers, node testing, automatic remediation for failure. Here is the full talk:

Group Created with Sketch.
TNS owner Insight Partners is an investor in:, Unit.
THE NEW STACK UPDATE A newsletter digest of the week’s most important stories & analyses.