Modal Title
IoT Edge Computing / Machine Learning / Software Development

Nvidia GPUs Nudge HPE Supercomputer into the Exascale

Once Polaris is assembled and put online in early 2022, it will have 44 petaFLOPS peak performance, ranking as the world's ninth-fastest supercomputer.
Aug 30th, 2021 5:00am by
Featued image for: Nvidia GPUs Nudge HPE Supercomputer into the Exascale
Feature Photo by Javier Esteban on Unsplash.

Enterprise IT systems provider Hewlett Packard Enterprise will start delivering and installing the computing components that will make up Polaris, a massive supercomputer that will be housed at the Argonne National Laboratory in Illinois and serve as a testbed for artificial intelligence (AI) and other projects for the lab’s upcoming Aurora exascale system.

Once Polaris is assembled and put online in early 2022, it will deliver more than four times the performance of the supercomputers currently being run at Argonne and, at 44 petaFLOPS (44 quadrillion  floating-point operations per second) peak performance, would rank as the ninth-fastest system on the twice-yearly Top500 list of the world’s fastest supercomputers, based on the most recent list released in June.

The supercomputer, which will include 2,240 A100 Tensor Core GPU accelerators from Nvidia, also will drive almost 1.4 exaFLOPS of theoretical AI performance, based on mixed-precision compute capabilities.

Polaris “is going to allow [Argonne’s] developers, their application holders, their engineers to start building out capabilities for accelerated computing at a grand scale,” Dion Harris, technical marketing leader at Nvidia, said during a press briefing about the supercomputer. “This is a very performant system, both in terms of AI as well as classic FP64 for first principles-based simulation. Therefore, we expect this to accelerate their core applications, as well as to set them up to have an incredible AI system, even when Aurora is brought online.”

Exascale on the Horizon

Aurora is one of three exascale supercomputers – along with El Capitan at the Lawrence Livermore National Lab and Frontier at Oak Ridge National Lab – being built in the United States and expected to go online in the next year or two. Exascale computing promises to enable researchers to run increasingly complex high-performance computing (HPC) workloads that current systems can’t handle and to help enterprises that are being overwhelmed with data and emerging workloads like AI and data analytics.

Other countries, including China, Japan and the European Union, also are building exascale supercomputers as they and the United States compete to see who can establish themselves as leaders in exascale and supercomputing. Those that do will have an edge in areas ranging from scientific research and the military to health care and the economy.

In mid-2019, the U.S. Department of Energy (DOE) awarded longtime supercomputer vendor Cray the $600 million contract to build El Capitan. The company already was named to build Aurora and Frontier. HPE became the system vendor when it bought Cray in September 2019 for $1.3 billion, a move that greatly expanded its presence in HPC.

Polaris Takes Shape

Now the company is building Polaris. It will be based on 280 Apollo Gen10 Plus systems, which were created for HPC and AI environments and built with the exascale era in mind. The systems will use 560 2nd and 3rd Gen Epyc server processors from AMD for improved modeling, simulation and data-intensive workflows, and the GPUs will help drive the supercomputer’s capabilities for running AI workloads.

Aurora, like Polaris, will be an accelerated system, though using Intel’s Xeon Scalable processors and the chipmaker’s upcoming Xe-HPC “Ponte Vecchio” GPUs.

Polaris, which will run HPE’s CrayOS operating system, will use HPE’s Slingshot Ethernet fabric designed for HPC and AI environments – which the vendor inherited when it bought Cray – as the high-speed interconnect and HPE Performance Cluster Manager to monitor and manage the supercomputer to ensure optimal performance. For storage, the supercomputer will use Eagle and Grand, both 100-petabyte Lustre systems developed last year by the Argonne Leadership Computing Facility (ALCF), a DOE science site, and supported by HPE’s Cray ClusterStor E1000 platform. The Eagle system enables data sharing within the scientific community, according to Argonne.

Polaris has been talked about for about a year, but the Aug. 25 announcement brought with it many more details.

“Polaris is well equipped to help move the ALCF into the exascale era of computational science by accelerating the application of AI capabilities to the growing data and simulation demands of our users,” ALCF Director Michael Papka said in a statement. “Beyond getting us ready for Aurora, Polaris will further provide a platform to experiment with the integration of supercomputers and large-scale experiment facilities … making HPC available to more scientific communities. Polaris will also provide a broader opportunity to help prototype and test the integration of HPC with real-time experiments and sensor networks.”

Accelerated Computing is Key

Nvidia’s Harris said supercomputing has been the driver behind pushing the boundaries of what technology can do in helping to solve a broad array of challenges, including finding cures for cancer, exploring fusion energy and addressing climate change. However, researchers have been hindered in recent years by the slowing of Moore’s Law at a time when the size of problems and the amount of data keeps growing. The entrance of AI into the equation, and it being used for many internet applications, drew interest from scientists about how the technology can be used for their research.

Leveraging GPU accelerators like those from Nvidia will help drive the performance of such workloads running on Polaris, he said.

“When we talk about how the technology is going to be used, it’s really exciting to see that scientists can get started now,” Harris said. “Once we deploy the system early next year, they’ll be able to start working on these applications [and] to port them to an accelerated model. They can start leveraging AI to really build out the capabilities of how they can look at this converged HPC-plus-AI model. Then again, they can start testing some of those theories by getting a head start on that leveraging Polaris.”

Once online, Polaris initially will be used by researchers participating in such programs as the DOE’s Exascale Computing Project and ALCF’s Aurora Early Science Program, which was created not only to enable scientists, engineers and other users to prepare key applications to run on a system of Aurora’s architecture and scale and to get libraries and infrastructure in place for other production applications, but also to tackle projects that current supercomputers can’t.

Those include projects ranging from advancing cancer treatments and addressing the United States’ energy security while reducing the impact on the climate to conducting particle collision research in the ATLAS experiment, which employs the Large Hadron Collider particle accelerator in Switzerland.

Within a few months of Polaris going online, it likely will be opened up to the wider research community, Harris said.

Group Created with Sketch.
THE NEW STACK UPDATE A newsletter digest of the week’s most important stories & analyses.