Year in Review: GenAI Exposed Silicon Valley Chip Antiquity
“We actually view AI as the single most transformational technology over the last 50 years. Maybe the only thing that has been close has been the introduction of the internet,” said AMD CEO Lisa Su at a December event.
The concept of AI has been lingering around for decades, but 2023 will be marked as a year that GPUs made full-fledged user-facing applications possible. Nvidia was ready with its hardware and software and smoothly slipped into the generative AI mania.
At the start of the year, cloud providers scrambled to upgrade data centers with brigades of GPUs and cooling systems to handle the epic AI rush. Nvidia GPUs successfully powered Microsoft’s accelerated rush into generative AI with Bing.
Intel and AMD, which have built a business around CPUs since the 1960s, overnight realized their traditional business model was not relevant to the new computing landscape. AI required accelerators to handle large loads of low-precision computing, which CPUs were not efficient at handling.
Generative AI exposed the high costs of running inference and training in data centers, and Microsoft and Meta kickstarted their own AI chip strategies to cut data-center costs.
Nvidia’s CEO Jensen Huang has become a rockstar and has been credited with making AI possible. The company’s market value has soared, exceeding $1 trillion.
Nvidia’s H100 GPUs are hot, and the wait list is long. The company was also well prepared on software with CUDA, which was originally released in 2007 as a set of programming tools that could take advantage of faster calculations on GPUs.
CUDA’s popularity as a software stack has surged with AI, and Nvidia has created readymade AI packages for verticals such as medical, automotive, and engineering. Intel and AMD are still trying to sort out their software stacks.
Nvidia has also attracted negative attention for trying to skirt around U.S. export restrictions to ship its GPUs to China, which has been historically a very important market for the company.
The U.S. last year imposed export restrictions around powerful GPUs and AI chips to choke China’s attempts to advance its AI infrastructure, but Nvidia switched specifications to make compliant GPUs for the China market. Earlier this year, the U.S. imposed further restrictions to stop Nvidia exports to China, but the company already has chips that are compliant with the new export restrictions.
Open Source versus Closed Source
There has also been a backlash against proprietary technology in AI represented by closed models such as GPT-4, and development stacks like Nvidia’s CUDA.
Intel and AMD appear to be hoping customers turn against closed models and adopt open source models like Llama 2, which removes the barrier of entry for customers to adopt AMD and Intel AI accelerators. Llama 2 already works with AMD’s GPU called MI300X, which was introduced recently, and Intel’s Gaudi 2, which the company is now shipping.
“There’s a lot of doubling down on open source frameworks for model development, things like [AMD’s] ROCm and [Intel’s] OneAPI. There’s also increased investment in ML frameworks from Apple, such as MLX, for MacOS on Apple Silicon,” said James Sanders, principal analyst for CCS Insight.
Nvidia’s rivals are also building tools so developers can just deploy and run models without worrying about the hardware running in the background. Intel’s SYCLomatic, for example, can strip out CUDA code, and point to the best hardware available to run calculations.
But what Intel and AMD want to achieve is still complex, starting with downloading the tuned models from Hugging Face, and then pointing the code to the correct hardware.
As the frameworks mature, and as foundation models are ported over, there will be more product messaging from hardware vendors about the cost-effectiveness of using their systems, and the availability and flexibility of open source AI/ML models on those platforms. Sanders said.
Cloud Providers as AI Chip Makers
Nvidia’s chips will not be going anywhere, but cloud providers are not putting their eggs in one basket. Toward the end of the year, the top three cloud providers introduced their latest AI chips in quick succession.
Microsoft’s Azure AI stack has largely been built around Nvidia’s GPUs, but this year introduced its Maia 100 AI accelerator, which is for training and inferencing.
Microsoft’s CFO Amy Hood has talked about the cost-per-transaction on AI, and how hardware and software tuning has led to higher GPU utilization rates, which has helped generate more income. Microsoft’s investment in homegrown silicon is to boost performance while cutting down the cost of using Azure, according to Microsoft.
Google recently introduced its TPU v5 chips in its cloud service for internal and external use. The TPU v5p, which is for training, will ultimately be used to train Google’s future transformer models, said Mark Lohmeyer, vice president and general manager of compute and machine learning infrastructure at Google.
The TPU v5 chip in an AI supercomputer that Google calls a “Hypercomputer.”
“It’s designed for great performance, great efficiency, very cost-effective, for the most common deployment scenarios that we see out there,” Lohmeyer said.
Changing Chips and Roadmaps
Chip makers have also spent time finetuning roadmaps for generative AI. Nvidia wants to take advantage of its early lead and release new GPUs every year until 2025, a change from its typical two-year cadence. Nvidia adopted HBM3e with its newest GPU, H200, which will ship next year, while AMD’s MI300X stacks up memory chips to increase capacity.
Intel scrapped multiple GPUs and is targeting its next GPU release, called Falcon Shores, for 2025. The original Falcon Shores design integrated a CPU and GPU into a single die but scrapped that design as there was more customer interest in a discrete GPU for generative AI.
Besides GPUs, the chip makers have made memory capacity and bandwidth top priorities in chip design. Queries are typically stored in memory until sessions end, and questions pile up in a single session, more memory is required, and fast throughput is needed for quicker responses.
A new system architecture around the Compute Express Link (CXL) specification will become standard in system designs to meet the memory and bandwidth requirements.
What to Expect in 2024?
“You’re going to see more custom chips from hyperscalers for more specific workloads,” said Jim McGregor, principal analyst at Tirias Research.
Generative AI is mostly a server activity, but will move out to client devices for better latency, McGregor said.
The offload to client devices will raise the relevance of neural chips or GPUs on client devices, which will serve as local accelerators. Laptops will be optimized to take on generative models for practical applications. There are already examples, such as Microsoft integrating GPT-4 with Copilot, and Adobe integrating generative AI in Firefly.
“These companies are not trying to run these generative models in the cloud. The focus will be on ‘how do you take the general models and how to optimize them for applications, how do you do knowledge distillation — for legal and medical applications etcetera,” McGregor said.