Need for Speed: Cloud Power Moves Expand AI Supercomputing
Even Elon Musk and his companies’ billions of procurement dollars can not acquire Nvidia’s latest GPUs for machine learning fast enough. That is how strongly demand for GPUs is outstripping supply.
However, there are some clues on where these rare AI training GPUs, such as the H100 and A100, are showing up — in new cloud-based AI supercomputers that were announced in quick succession within the last month.
Cloud providers are picking up investments in data-center equipment after a relative lull in upgrades during the first half of this year. Hyperscalers are rewiring systems in new ways to meet the processing and power demands of AI applications.
GPUs are a cornerstone of machine learning and are interconnected with high-bandwidth memory, networking, and high-capacity storage. Newer data centers also use data processing units to transport data at an unprecedented pace between systems.
Customers will have to fork out $37,000 to access Nvidia’s latest GPUs in its DGX Cloud, which became generally available in July. That is just the starting price, per instance per month, and it goes up for more memory, storage, and faster GPUs. There are cheaper options — users can access virtual machines with Nvidia’s A100 GPU on Microsoft’s Azure cloud for close to half the price.
Nvidia’s DGX Cloud is deployed in its own data centers in the U.S. and U.K., and in Oracle’s cloud infrastructure. Each node of DGX Cloud has eight H100 or A100 GPUs, 640GB of GPU memory per node, and Nvidia’s proprietary NVLink interconnect.
Nvidia CEO Jensen Huang likens DGX Cloud to an “AI factory” — an assembly line of data that is churned inside a proprietary black box, with the output being usable data. Customers need to only worry about the results and not the hardware and software.
But DGX Cloud already has some competitors on the horizon.
Andrew Feldman, the CEO of AI chip maker Cerebras Systems, is not an Nvidia fan. His company makes the world’s largest chip, called the WSE-2, which can train large-language models with billions of parameters on a single chip. Large Language Models typically require multiple GPUs to train.
Cerebras has been business since 2015, but finally found a commercial adopter of its AI systems in G42, a Middle Eastern cloud provider, which will deploy the hardware in three U.S. data centers by the end of this year. That is a breakthrough for Cerebras, especially with other AI chip makers struggling to even get companies to sample their products.
Cerebras’ CG-1, CG-2, and CG-3 systems will be hooked up to form a large AI supercomputer. Each system will deliver four exaflops of performance. The number of systems deployed will expand to nine by the end of the year, delivering a total of 36 exaflops of performance.
“We support up to 600 billion parameters extensible to 100 trillion, fed by 36,000 AMD Epyc cores,” Feldman said. The Cerebras chips are not GPUs, but specialized AI chips with 850,000 cores, 40GB of on-chip SRAM, and 20PBps of on-chip throughput.
Cerebras’s hardware offers some AI chip variety outside of Nvidia’s GPUs, which currently dominate the market. Cerebras’s AI supercomputers until now were largely experiments at U.S. government labs, but are now set for wider commercial use. Companies including GlaxoSmithKline and TotalEnergies have put the systems through stress tests.
Programming to Nvidia’s GPUs can be complicated, and require thousands of lines of code to fully exploit the processor’s computing power. Feldman said it takes just a few lines of Python code to get training going on Cerebras chips.
“It’s talked about as three-dimensional parallelism. That is where you have to do tensor model parallel and pipeline model parallel. That is where you have all these complicated tricks to break up work and spread it over a large number of GPUs. That is the additional 27,000 lines of code. And that is exactly the code that we don’t need,” Feldman said.
Mojo Dojo for Tesla
Tesla recently announced it had started production of its Dojo supercomputer, which will train video that will ultimately allow the company to deploy autonomous driving systems to its cars. Dojo uses Tesla’s self-developed D1 chip, which has 22.6 teraflops of performance.
During the earnings call, Musk said the company is spending close to $1 billion through the end of 2024 on its Dojo supercomputer.
“We think we may reach in-house neural net training capability of 100 exaflops by the end of next year” with GPUs and Dojo, Musk said.
Musk is a big fan of GPUs for video training to support AI on its electric vehicles. The company collects visual data from cameras in its electric cars, which it then uses to create an AI system that will ultimately serve cars to improve driver safety.
But the GPU shortage is slowing down machine-learning tasks at the company. The company plans to deploy 300,000 Nvidia A100 GPUs by the end of next year.
“We’re using a lot of Nvidia hardware. We’ll continue to — we’ll actually take Nvidia hardware as fast as Nvidia will deliver it to us. Tremendous respect for [CEO Jensen Huang] and Nvidia. They’ve done an incredible job,” Musk said in an earnings call this month.
“Frankly, I don’t know if they could deliver us enough GPUs,” he added.
In late July, Amazon Web Services launched the EC2 P5 instances, which will bring Nvidia’s latest H100 GPUs to the cloud service.
P5 instances are the fastest VMs in AWS’s portfolio, the company representatives said at AWS Summit in New York. P5 will be six times faster than its predecessor, P4, and reduce training costs by up to 40%, company representatives said
The P5 instances will be interlinked into UltraScale clusters, and up to 20,000 GPUs can be interconnected to create a mammoth AI training cluster.
“This enables us to deliver 20 exaflops of aggregate compute capability,” said Swami Sivasubramanian, vice president of database, analytics, and machine learning at AWS, during a keynote at the summit.
Typically training tasks are cut down into smaller parts, which are then shared among GPUs in a large cluster. The processing and response time needs to be synchronized carefully to ensure timely coordination among outputs.
The EC2 P5 instances use a 3200Gbps interconnect to synchronize weights and for quicker training. The faster GPUs and throughput ensures companies can work with larger models, or cut infrastructure costs on the same-sized models.
AWS is taking a different approach to AI than rivals Google Cloud and Microsoft, which are trying to lure companies to use their large-language models, which are respectively called Palm-2 and GPT-4. Software companies are paying to link up to OpenAI’s GPT models via the API model.
Amazon also wants AWS to be a storefront that also dishes out a wide variety of latest AI models, with its cloud service providing the computing horsepower. The company announced access to Anthropic’s Claude 2, and Stability AI’s Stability Diffusion XL 1.0, which compete with transformer models from Google and Microsoft.
“Models are just one part of the equation, you need to have the right infrastructure. You need to provide the right workflow support… the right enterprise security for every little piece of the workflow. That is where we’ll focus on … going forward,” said Vasi Philomin, vice president and general manager for generative AI at Amazon.