CUDA 12 Harnesses Nvidia’s Speedier GPU Architecture
GPU maker Nvidia will soon release the next version of the CUDA parallel-programming framework, version 12, to accompany the release of its new GPU architecture code-named Hopper.
“It’s the biggest release we’ve ever done,” said Stephen Jones, CUDA architect at Nvidia, during a break-out session held at Nvidia’s GPU Technology Conference being held virtually earlier this month.
CUDA started off as a simple programming language in June 2007 targeted at graphics, and is currently in version 11.7, with one major update, version 11.8, due before the move to version 12.
Jones didn’t provide an exact shipment date for CUDA 12, but past release timeline points to a version 12 available for download either late this year or early next year.
Nvidia typically releases a new version of CUDA with every new GPU architecture. This is the first time in two years that CUDA users will experience a major version change.
GPUs were initially popular for graphics, but the ability for the chips to compute in parallel planted the seed for Nvidia’s hardware to be used in non-graphics applications. Today, Nvidia’s GPUs dominate the market as accelerators for AI, simulation, graphics and supercomputing. But the proprietary CUDA parallel programming model works best only on Nvidia’s GPUs, and that is forcing customers to buy the company’s hardware.
Nvidia is now trying to shift gears to grow in the software business by selling AI software applications developed in CUDA. The company sees a $1 trillion market opportunity in software, with CUDA-based applications going into self-driving cars, robots, medical devices, and other AI systems.
A typical CUDA program has a GPU code section, which includes the code for execution on graphics cores, and a CPU code section, which sets up the execution environment that includes memory allocation and hardware management. CUDA also has a runtime system that includes libraries and a compiler that compiles the code into an executable.
CUDA binaries have CPU and GPU sections, and a separate PTX assembly code section, which acts as a backward and, to some extent, forward compatibility layer for all versions of CUDA dating back to the first edition in 2007.
But CUDA 12 applications will break on CUDA 11. Starting with CUDA 11, Nvidia included a compatibility layer so APIs don’t break in in-line versions, for example, an application built on CUDA 11.5 will work with CUDA 11.1. But that compatibility layer doesn’t apply to completely new versions.
“You can’t run CUDA 12 applications, say, on a system with 11.2 installed because API signals may have changed across a major version,” Jones said, adding: “this means two things. first, you need to care what major version of CUDA is running on your [system], and second, some APIs and data structures will change.”
CUDA 12 is specifically tuned to the new GPU architecture called Hopper, which replaces the two-year-old architecture code-named Ampere, which CUDA 11 supported. The flagship Hopper-based GPU, called the H100, has been measured at up to five times faster than the previous-generation Ampere flagship GPU branded A100. The speed enhancements in Hopper come through a host of new features such as beefier throughput and interconnect technologies, faster tensor cores for AI, and vector and floating-point operations.
Hopper has 132 streaming multiprocessors, PCIe Gen5 support, HBM3 memory, 50MB of L3 cache and the new NVLink interconnect with 900GB/s bandwidth.
If you want to get the best performance out of Hopper, you will only get it from CUDA 12. Nvidia is keeping its hardware and software close to its chest, and if you use Khronos’ OpenCL, AMD’s ROCm and others parallel programming tools, you won’t be able to harness the full power of Hopper.
The Hopper H100 GPU focuses on keeping data local, and reducing the time it takes to execute code. The GPU has 132 streaming-multiprocessor (SM) units in the H100, up from 15 in Kepler from ten years ago. Scaling across the SMs is central to CUDA 12, Jones said.
The CUDA programming model, at its core, asks users to break up work — like processing an image — in blocks, which are organized next to each other in a grid. Each block runs on a GPU like it’s a separate program, and Hopper can run several thousands of blocks at once. Each block, working on its own problem, is further broken down into threads.
Nvidia has broken down that grid-block-thread hierarchy even further with a new layer called the “thread block cluster.” The “thread block cluster” basically breaks down the old structure and woven in interconnected mini-grids at the block level, which all adds up to the larger grid. Because of its massive scale, “we’ve taken the concept of a grid makeup made up of wholly independent blocks of work,” Jones said.
The SMs have been organized in that hierarchy of thread block clusters, which exchange data simultaneously in a synchronized way. The 16 blocks run close to 16,384 threads simultaneously, which is a huge amount of concurrency, Jones said, adding that every block in a cluster can read and write the shared memory of every other block in the cluster.
“What we’ve made is a way to target a localized subset of your grid to a localized set of execution resources that opens up more opportunities for programmability and performance,” Jones said.
The thread block cluster feature in the programming model has new syntax that allows developers to define the launch size and the resources it needs for a task instead of relying on the CPU to do it correctly.
Another new Hopper feature is an asynchronous transaction barrier that reduces the back and forth of data for quicker execution of code. The asynchronous transaction barrier is more like a sleeping room in which waiting threads doze off until data from other threads arrive to complete a transaction. That reduces the energy, effort and bandwidth required to move data.
“You just say ‘Wake me up when the data has arrived.’ I can have my thread waiting … expecting data from lots of different places and only wake up when it’s all arrived,” Jones said.
In chips, work is commonly broken up into threads, which have to coordinate with each other. With normal barriers, threads typically track where data is coming from and the source it is synchronizing with, but that’s not the case in Hopper, which is just a single-write operation.
“The asynchronous memory copy knows how many bytes it’s carrying. The barrier knows how many it is expecting. When the data arrives, it just counts itself in. These are one-sided memory copies and they are seven times faster [communication] because they just go one way and don’t have to go back and forth,” Jones said.
Hopper also has a new processing unit called the Tensor Memory Accelerator, which the company has classified as a data movement engine. The engine enables bidirectional movement of large data blocks between the global and shared memory hierarchy. The TMA also takes over asynchronous memory copy between thread blocks in a cluster.
“You call [TMA] and it goes off to do the copy, which means the hardware is taking over the job of calculating addresses and strides, checking boundaries, all that kind of stuff. It can cut out a section of data … and just drop it into shared memory or put it back the other way. You don’t have to write a single line of code,” Jones said.
Hopper has new DPX instructions for something Nvidia calls “dynamic programming,” where one can efficiently find the solution to a larger problem by recursively solving overlapping subproblems. This could make CUDA 12 relevant to applications involved in computation that tracks traces to optimize or solve problems, like mapping or robotic path tracing.
“It’s very similar to a divide and conquer approach …. except it is overlapping data which is harder to solve,” Jones said.
Nvidia has also enhanced the concept of dynamic parallelism, which allows the GPU to launch a new kernel directly without invoking the CPU. “By adding some special mechanisms to the dynamic parallel programming model, we’ve been able to speed up the launch performance by a factor of three,” Jones said.
An Nvidia moderator didn’t clarify if the dynamic parallelism would advance to the OpenMP or OpenACC standards, saying “whether it makes its way into the standards as an explicit language feature depends on the committees.”
Nvidia is actively trying to upstream some features in the CUDA toolkit as part of standard C++ releases. CUDA has its own compiler called NVCC, which is designed for GPUs, and Runtime API with a simple C++-like interface. GPUs typically have computing elements like vector processing which are more adept for applications such as AI, and the runtime is built on top of a driver API.