Nvidia’s CUDA 12 Is Here to Bring out the Animal in GPUs
Programmers will finally be able to harness the full computing power of Nvidia’s latest GPUs, code-named Hopper, with new software tools released by the company on Monday.
Nvidia is now shipping its CUDA 12 programming tools, which are the driving force behind the company’s future in artificial intelligence and graphics. CUDA’s footprint spans most of Nvidia’s company’s hardware and software products.
“CUDA in some ways is an interface to the GPU. But it is an interface which is a programming model, a whole set of tools, and a large array of libraries,” said Stephen Jones, CUDA architect at Nvidia, in an exclusive interview with The New Stack.
CUDA provides the core foundation of communicating with the GPU. Nvidia has built layers of software above that. Without CUDA 12, developers will not be able to take advantage of Nvidia’s latest H100 GPU, which is the company’s fastest graphics processor to date.
“As soon as the hardware was ready, we wanted to go out and put CUDA 12 into people’s hands,” Jones said.
Nvidia typically releases a new version of CUDA with each new GPU architecture. The previous version was CUDA 11, which was released with its previous-generation GPUs based on the Ampere architecture.
The first version of CUDA came out in 2007 as a set of programming tools to write applications that could take advantage of faster calculations on GPUs. Nvidia’s GPUs have now become a mainstay in AI computing, and CUDA’s popularity as a software stack has surged.
The release of CUDA 12 lines up with last week’s release of Nvidia’s AI Enterprise 3.0 software, which includes pre-configured AI tools and models for applications in areas such as robotics, security, autonomous cars and health care. Some of the new tools have been built on CUDA 12.
To get the best of CUDA 12, programmers will need to have systems with Hopper GPUs, which are not widely available yet. The GPU is expected to ship in volume soon.
The CUDA 12 framework was developed as the hardware was developed, and takes advantage of the parallelism of the chip to speed up graphics and AI.
Hopper can handle 16,384 tasks in parallel and 132 streaming multiprocessors, an improvement from just 15 a decade ago in the Kepler architecture. Data also moves faster within the GPU with support for the PCIe Gen5 interconnect, the NVLink interconnect with 900GB/s bandwidth, and HBM3 memory.
A CUDA program has separate sections for code to execute GPUs and CPUs, which includes memory allocation and hardware management in the execution environment.
CUDA’s runtime system includes libraries and a compiler to turn code into an executable. CUDA has an assembly code section called PTX, which provides both forward and backward compatibility layers for all versions of CUDA all the way down to version 1.0.
One of the biggest advances in CUDA 12 is to make GPUs more self-sufficient and to cut the dependency on CPUs. Furthermore, Hopper cuts the distance that data travels, and reduces the number of times that data is exchanged with memory. CUDA 12 takes advantage of those hardware features for faster calculations and training of AI models.
“There’s no way to bring new hardware in without updating data structures, New device properties on Hopper that didn’t exist before — we’ve got to go and update those,” Jones said.
GPUs have typically relied heavily on CPUs for tasks like decompressing and image processing. Nvidia said the constant reliance could be cut by moving more of the computing to GPUs. CUDA 12 and Hopper can do that dynamically without exiting the GPU, a technology that Nvidia calls “dynamic parallelism.”
“The idea is to keep the GPU full and busy and operating on its own,” Jones said,
For example, a neural network on a GPU can define the precision that would be most efficient based on the input of data and the range of values in the data.
“Those types of things are very dynamic decisions and the ability to do them entirely on the GPU … keeps the GPU busy more, it keeps the CPU from having to constantly pay attention to and run the control code,” Jones said.
Generating work historically has been expensive with GPUs, but now it is down to a level where CPU and GPU have the same cost to generate work, Jones said, adding “now the decision-making on the GPU can be very fast.”
The CUDA programming model breaks up work like processing images into smaller blocks placed next to each other. Each block runs like a separate program. When the processing is complete, the results are then combined into an answer. Hopper can solve separate problems over thousands of blocks and break them down further into its own threads. The Hopper GPU can run 16,384 threads simultaneously.
“When we recognize the GPUs got a lot bigger, there’s now more scope to bring more threads to be alive, and at the same time to bear on a bigger problem because I’ve got more processing power, more resources, more threads,” Jones said.
CUDA 12 takes advantage of the redesigned Hopper, which is organized into a new grid hierarchy that Nvidia calls the “thread-block cluster.” The new structure gives Hopper more independent blocks of work and threads to handle demanding tasks.
“It guarantees concurrency, which is the key piece. I can synchronize between them, I can exchange data between them. This is one of the places where we do a lot of co-design between the software and the hardware,” Jones said.
CUDA 12 can also execute applications by localizing transactions and reducing the distance that data travels. The streaming multiprocessors are next to each other, which means electrons don’t have to travel far. The greater parallelism means better performance, and CUDA 12 programmers do not have to go to CPUs for resources.
“Exploiting this locality is the thing that the hardware was trying to bring out. And providing graded parallelism is the thing that the programming model is getting,” Jones said.
Hopper has a feature called an asynchronous transaction barrier, where threads can wait until data from other threads arrives to complete a transaction.
Typically, threads synchronize and coordinate with each other and track data movement, but Hopper saves the time it would typically take to look for the data, which in turn saves energy, effort and bandwidth involved in moving data.
The ability to asynchronously move data around, and to wait on the completion signal rather than having to monitor it all the time, is a more efficient way to decouple work among the threads.
“It is exploiting the locality factor that these things are very close to each other. It is very fast to send … these counting signals around. It is also putting a lot of that management into hardware, which is both faster and avoids needing to spend cycles while you’re computing, and stopping and looking at where your data is,” Jones said.
A new processing unit called the Tensor Memory Accelerator — which Nvidia has called the data movement engine — allows bidirectional movement of large data blocks between the global and shared memory hierarchy. The TMA also takes over asynchronous memory copy between thread blocks in a cluster.
CUDA 12.0 supports the C++20 standard, which enables host compilers such as GCC 10, Clang 11 and ArmC/C++ 22.x. Nvidia has its own ARM CPU that it is pairing with its GPUs, and as a result, CUDA’s improvements are heavily centered around the ARM architecture.
Intel has its OneAPI tool for parallel programming, which also includes the SYCL tool that can strip proprietary CUDA-specific code so applications can run on its hardware. AMD is backing the ROCm, which is an open parallel programming framework, for its GPUs. Khronos is backing the OpenCL parallel programming framework.