Meta Building Massive AI Supercomputer for Metaverse
Facebook parent company Meta is building what will be a massive GPU-driven supercomputer to run its artificial intelligence (AI)- and machine learning-driven workloads that will be key to the development of the metaverse.
The company already has put Phase 1 of the AI Research SuperCluster (RSC) online and company researchers have begun using it to train large machine learning models in such areas as natural language processing (NLP) and computer vision, according to a blog post written by Shubho Sengupta, a software engineer with Facebook AI Research (FAIR), and Kevin Lee, technical program manager at Meta.
The goal is to complete the supercomputer by June, when it will have 16,000 Nvidia Tensor Core A100 GPUs with the chip maker’s DGX A100 systems as its compute nodes connected via InfiniBand. It will provide a caching and storage system that can serve 16 Terabytes-per-second of training data and will scale up to an exabyte of performance. The RSC currently comprises 760 Nvidia DGX A100 systems, with 6,080 GPUs.
The system will be able to train machine learning models with trillions of parameters. The RSC will be completed five years after Meta — then known as Facebook — built a system in 2017 that has 22,000 V100 GPUs from Nvidia in a single cluster that runs 35,000 training jobs a day.
“In early 2020, we decided the best way to accelerate progress was to design a new computing infrastructure from a clean slate to take advantage of new GPU and network fabric technology,” Sengupta and Lee wrote. “We wanted this infrastructure to be able to train models with more than a trillion parameters on data sets as large as an exabyte — which, to provide a sense of scale, is the equivalent of 36,000 years of high-quality video.”
A Fast AI System
Meta officials predict that the RSC will be 20 times faster than the one built in 2017 and will be nine times faster running Nvidia’s Collective Communications Library (NCCL) tools and three times faster running large-scale NLP workloads. They predict that once complete, the system will be the fastest AI supercomputer in the world.
Dan Olds, chief research officer for Intersect360 Research, said that by his calculations, once completed, the RSC will have more than 110 million GPU cores, far more than the accelerator-heavy Tianhe-2A supercomputer in China, which uses more than 4.5 million Matrix-2000 cores. Tianhe-2A between 2013 and 2015 was the world’s fastest supercomputer and still sits at number seven on the Top500 list.
“That’s just monstrous, an order of magnitude higher than the biggest GPU-accelerated system today,” Olds told The New Stack. “This is going to [grow] accelerator technology an order of magnitude higher than we’ve seen before. In terms of an all-AI machine — which is what it’s going to be obviously — it’s again an order of magnitude larger than anything we’ve seen before. And to have a private company do this, that’s also a paradigm-breaker.”
Lessons for Developers
Developers need to take note of Meta and its RSC supercomputer and the increasing role that GPUs will have in the future of computing, according to Olds.
“This is … orders of magnitude a bigger and better system for them to use,” he said. “They’ll be able to do machine learning at a scale that no one has done before. For developers in general, what it should tell them is that if it already hasn’t become blindingly obvious is that they need to have deep and rich skills in things like CUDA [Nvidia’s platform for using GPUs in general-purpose computing] and don’t go to sleep on AMD’s MI 200 GPU, which offers even higher performance than Nvidia’s A100 right now.”
Olds added that “the GPU wars are on, but developers are going to have to take advantage of this power in order to build sophisticated AI models and teach them in anything like a decent timescale.”
Phase 1 Is in and More’s Coming
The RSC’s first phase includes not only the A100 GPUs but also 175 Petabytes of bulk storage, 46PB of cache storage and 10PB of NFS storage, as well as an InfiniBand interconnect that delivers 200GB/s of bandwidth per GPU with no over-subscription, according to Meta. Olds said there are still questions that need to be answered, including how the company is cooling such a large and powerful system.
However, given the massive number of GPUs in the supercomputer, “it doesn’t matter what the CPUs are and what how many CPU cores,” he said. “It’s not a CPU game anymore. It hasn’t been for quite a while, but it’s an accelerator game. They’re building this thing at the very edges of what’s possible today.”
“Designing and building something like RSC isn’t a matter of performance alone but performance at the largest scale possible, with the most advanced technology available today,” the Meta officials wrote in their blog post. All this infrastructure must be extremely reliable, as we estimate some experiments could run for weeks and require thousands of GPUs. Lastly, the entire experience of using RSC has to be researcher-friendly so our teams can easily explore a wide range of AI models.”
Ongoing partnerships also are key, they wrote. Penguin Computer, Pure Storage and Nvidia all worked with Meta to build the first-generation AI infrastructure. Penguin now is helping with the architecture, hardware integration for deploying the cluster and setting up the control plane, and managed services, Pure provided the scalable storage solution and Nvidia not only supplied the edge systems, GPUs and InfiniBand fabric, but also software stack components like NCCL.
Pandemic and Chip Shortages Add Challenges
George Niznik, sourcing manager at Meta, said in a video released by Meta that in 2020 it was decided that another supercomputer was needed, but planning and building it wasn’t an easy task.
“One does not simply buy and power on a supercomputer,” Niznik said. “RSC was designed and executed under extremely pressed timelines and without the benefit of a traditional product release cycle. Additionally, the pandemic and a major industry chip supply shortage hit at precisely the wrong moment in time. We had to fully utilize all of our collective skills and experiences to solve these difficult constraints.”
Due to COVID-19, the project began as a remote project with a simple shared document, according to Sengupta and Lee. Components from GPUs to optics were difficult to get and construction materials had to be sent under new safety protocols. The RSC also was designed from scratch, which meant creating new conventions specific to the company’s needs. There were new rules for such data center design issues as cooing, power, rack layout and networking, including a new control plane.
Meta also created a storage service dubbed AI Research Store (AIRStore) to address highly scalable storage needs.
“To optimize for AI models, AIRStore utilizes a new data preparation phase that preprocesses the data set to be used for training,” they wrote. “Once the preparation is performed one time, the prepared data set can be used for multiple training runs until it expires. AIRStore also optimizes data transfers so that cross-region traffic on Meta’s interdatacenter backbone is minimized.”
A Belief in the Metaverse
The new supercomputer is the latest indication of Meta’s belief in the metaverse — as if Facebook changing its name to Meta wasn’t enough — the mixed-reality world that Meta, Nvidia, Microsoft and others are betting will be the future of business and social life, with AI and machine learning playing a key role. The RSC is Meta “doubling and tripling down on that vision and putting the machine horsepower that they’re going to need to get there into their data centers,” Olds said.
The metaverse was a driving force behind Microsoft’s $68.7 billion planned acquisition this month of gaming company Activision Blizzard and Meta’s ongoing development of its Oculus VR technology, said Olds, a metaverse skeptic.
“If we do find that killer app for the metaverse, that’s going to soak up so much hardware and so much programing … that it will make our industry spin,” the analyst said. “But without that killer app, what’s the point? They’re going to keep this system busy doing a lot of machine learning, honing their advertising to ever greater rate heights, and it is going to be a great big, huge, fat reference client. That’s going to give them air cover for every deal they want to do.
Nvidia’s Place in Server Universe Grows
For Nvidia, being the project’s central compute technology provider puts the company with Dell EMC, Hewlett Packard Enterprise and its Cray business, Fujitsu and others as a major server provider, Olds said. Nvidia is trying to buy Arm in a $40 billion deal that is getting substantial pushback from regulators in the United States and elsewhere as well as some competitors who also use the Arm infrastructure in their chips.
Nvidia has also used Arm’s infrastructure in the past and can do it again. It doesn’t have to buy Arm, he said, adding that the company has “the intellectual horsepower and the money to build their own Arm variant, just like Fujitsu did for Fugaku [the fastest supercomputer in the world, according to the Top500 list]. They don’t have to own Arm in order to build Arm.”