Gunslinging AMD Tough on Software as Developers Balk

AMD is going gangbusters in selling hardware, but it has not been as dynamic on software. The company is now sending a message that it is getting its software act together.
In 2021, prominent software developer George Hotz cold-emailed Nvidia CEO Jensen Huang on a GPU driver problem. The issue was resolved in under 12 hours.
“This is why they are the king,” Hotz said in a video posted this month, praising Nvidia’s software commitment. By comparison, Hotz’s requests to AMD’s top brass and his contacts on fixing GPU software drivers either went unanswered or were not addressed immediately.
Frustration
This point illustrated Hotz’s frustration with AMD’s broken software approach, and how it was bogged down by bureaucracy and red tape. Stable GPU drivers are critical to run AI and other applications on computers.
Hotz’s lengthy criticism did not look good on AMD, which has built its business on hardware products, including its GPUs and PC and server chips. Hotz recently posted a follow-up blog about AMD’s response, and how the company worked with him on fixing the driver issue.
AMD took another opportunity to convince developers of its software commitment during a data center event on June 13. The hardware event was headlined by a new GPU called the MI300X, which is positioned as a rival to Nvidia’s dominant A100 and H100 AI GPUs.
The event had a portion committed to talking about software, which perhaps unintentionally, turned into a slinging match against Nvidia’s proprietary approach to AI.
HuggingFace’s CEO Clem Delangue and PyTorch creator Soumith Chintala, who appeared on stage, poured coal into the fire set by AMD executives on an emerging controversy of open AI and science versus closed AI models and development. They also shared how their companies were optimizing the top-to-bottom AI software stack for AMD’s MI300X GPU.
GPU Gatekeepers
GPUs are emerging as a gatekeeper for AI software to work. Right now, Nvidia’s GPUs are the top option, and AI response times are faster when tuned to work on Nvidia’s GPUs. Rivals Intel and AMD have introduced GPUs, AI chips, and corresponding software frameworks, but are still playing catch up with Nvidia.
Nvidia’s high-end GPUs, the H100 and newer A100, are in short supply. Microsoft is installing H100 GPUs in its next Azure AI supercomputer. Google Cloud at last month’s I/O conference announced the A3 supercomputer, which has 26,000 H100 GPUs. Nvidia’s market cap is hovering at around $1 trillion.
AMD’s MI300X breaks Nvidia’s AI hardware monopoly and is an option for companies looking to buy alternatives off the shelf.
AMD also positioned MI300X as being hardware on which an open AI approach could blossom. By comparison, Nvidia has a proprietary approach, and developers turn to its CUDA software framework to harness the full performance of its GPUs.
The Stacks
AMD’s AI software strategy is pinned on the success of ROCm, which is a parallel programming framework that is the company’s equivalent to Nvidia’s CUDA. ROCm is in its fifth generation.
A significant portion of the ROCm stack is open, including the drivers, run times, debuggers, and libraries. It also supports the open AI ecosystem, which includes open frameworks, models, and tools. ROCm supports many Large Language Models (LLMs), data types like FP8, and tools like OpenAI’s Triton.
During his onstage appearance, Delangue said it will start supporting open ML models on AMD hardware including the MI300X GPU and Ryzen CPUs.
HuggingFace shares half a million ML open models, including popular ones such as StableDiffusion, which is a popular text-to-image model that has inspired many forks. HuggingFace has some 15,000 companies using their AI software.
“I’m super excited, in particular, about the ability of AMD to power large language models in data centers, thanks to the memory capacity and bandwidth advantage, ” Delangue said.
HuggingFace offers instances with Nvidia’s older GPUs such as T4, which are also available for free in Google’s cloud via Colab. But AMD is poised to become a major part of its AI offerings.
“We will … include AMD hardware in our regression tests for some of our most popular libraries like transformers and our CI/CD to ensure that new models, like the 5,000 that have been added last week, are natively optimized for AMD platforms,” Delangue said.
Level the Playing Field
He also said that an open model – which is HuggingFace’s approach — will level the playing field in AI.
“By doing so, most of the time with customized specialized smaller models, it makes AI faster, cheaper, and better to run. It also makes it safer,” Delangue said.
PyTorch creator Chintala spoke in support of AMD’s Instinct GPUs. PyTorch is fundamental software through which most of the AI — neural networks, training, and inference — happens. The new Pytorch 2.0, which is powered by OpenAI’s Triton language and compiler, gives speedups that are about 50% to 100% faster.
Meta, which developed PyTorch, partnered with AMD to build the ROCm stack on a bunch of operators and integration to robustly test the whole stack.
Chintala said moving over workloads from a “single dominating vendor” — an indirect reference to Nvidia — to support alternative hardware requires a lot of software work, such as moving neural network workloads from one platform to another. PyTorch’s integration to AMD’s ROCm means developers could switch between AI hardware without breaking a sweat.
“Developers… are going to have a huge productivity boost as they try to switch to the AMD backend of PyTorch versus the … TPU or the Nvidia backend. I am excited about the overall productivity developers would have when they’re switching to the AMD backend overall,” Chintala said.
Open vs. Lock-in
OpenAI, Microsoft, and Google, which made contributions to early AI research, have now locked down their transformer models. OpenAI is generating revenue by charging customers for access to GPT-4, while Google has cited safety as one reason to lock down models such as PaLM-2, which powers AI capabilities in its search tools.
AMD’s software story is a weak link in the company’s business strategy, said Jim McGregor, an analyst at Tirias Research.
“They don’t have the size that Nvidia and Intel have, and they are years behind. They were slow to go after AI and this market has reached critical mass, and AMD was not there when it happened,” McGregor said.
AMD did not have a footing in markets like automotive and telco, where AI was most relevant, but that changed when they acquired FPGA provider Xilinx, which has a strong presence in those markets.
By comparison, Nvidia saw an early opportunity in AI, and quickly pivoted by building a supporting software and services ecosystem around its GPUs. The size of AMD’s software development team also dwarfs compared to Intel, which has been a major contributor to the Linux kernel.
AMD is leveraging OpenAI’s Triton, which could be a way around Nvidia’s CUDA. But developing a full software stack is beyond AMD’s capacity, and they should sign on for support of something like Intel’s OneAPI, which has a wide range of development options, including SYCL, which allows for AI code to be ported across a wide range of hardware beyond GPUs, McGregor said.
AMD’s top priority is in recapturing chip market share from Intel, which it has done well. AMD may not be the first out of the gate in software, but being first-to-market isn’t always a good thing as results are not guaranteed.
But when AMD commits and puts a roadmap out, they execute, and there is a good chance that AMD will stick to its software commitments, McGregor said.