If you want to know why the coming exascale computing era will be so important, look no further than the past two years and the COVID-19 pandemic, according to Justin Hotard, senior vice president and general manager for Hewlett Packard Enterprise’s HPC and AI Business Group.
“It’s an example of where the world has gotten more complex and we need our scientists and researchers to have more power to solve to solve these problems that are in front of us: a global pandemic, the virology research coming out of that, the preparation for a vaccination,” Hotard told The New Stack. “But we’re also seeing the potential of what better predictive models would have done for care and some of the preventative actions that were taken. That’s a great example of where a system like this can only help us save lives and protect our citizens.”
The high-performance computing (HPC) industry for a decade has been planning for the arrival of exascale systems, supercomputers that can process at least one exaflop, or a quintillion (a billion billion) calculations per second (1018). After years of planning, innovations and missed deadlines, the world is ready to fully embrace exascale computing.
Faster Workloads, More Data Crunching
The transition from petascale to exascale computing will enable scientists, developers, HPC organizations and enterprises run their huge and complex workloads — think drug discovery, climate prediction and energy exploration, as examples — faster and more efficiently and even do computing jobs that in the past were impossible with petascale and pre-exascale systems. Combining artificial intelligence (AI) with exascale computing ramps up the supercomputers’ capabilities to crunch the massive amounts of data that is being generated.
The vast capabilities these systems will have to carry not only have technical implications in a broad range of sectors, from military and scientific research to business and the economy, but also have sparked competition between nations — particularly the United States and China, but also Japan and the European Union — to lead in the exascale field.
Strides Made in 2021
This year saw a number of steps in this direction.
China in 2021 reportedly brought two exascale systems online and has another in the works. One called OceanLight — the successor of the massive TaihuLight developed by Sunway — offers 1.05 exaflops of performance, according to reports out of the SC21 supercomputing show in November. Another, Tianhe-3, comes 10 years after its predecessor, Tianhe-1A, reached number-one in the Top500 list of the world’s fastest supercomputers. It delivers 1.3 exaflops, with a peak performance of 1.7 exaflops.
A third, being built by system maker Sugon, is still in development.
Intel CEO Pat Gelsinger at a company event in November surprised the industry when he announced that Aurora, one of three U.S.-based exascale systems that will launch between now and 2023, will exceed 2 exaflops in peak performance, twice what had been expected. (Gelsinger also said Intel’s goal was to get to zettascale computing by 2027.)
Company officials attributed Aurora’s performance boost to Intel’s new “Sapphire Rapids” CPUs and better-than-expected capabilities of its “Ponte Vecchio” GPUs.
In addition, Intel in October said it is collaborating with European chip maker SiPearl in a joint venture to combine Ponte Vecchio GPUs with SiPearl’s Rhea CPUs to create a high-performance node for exascale system deployments in the EU.
Acceleration Coming in 2022
Moving into 2022, the United States will accelerate the growing momentum in the space. The first of three planned exascale systems — Frontier, which will be powered by AMD Epyc processors and Radeon Instinct MI200 GPUs — is being assembled at the Oak Ridge National Laboratory and is expected to deliver a performance of 1.5 exaflops. On the heels of that will come Aurora, which will run on Intel’s new 4th Generation Xeon Scalable Sapphire Rapids CPUs and Xe-HPC Ponte Vecchio GPUs. It’s expected to be completed later in 2022 at the Argonne National Lab.
In 2023, El Capitan, powered by AMD CPUs and GPUs, will launch at the Lawrence Livermore Lab and is expected to exceed 2 exaflops in performance.
All three systems are being built by HPE, which solidified its place in the supercomputing field when it bought systems maker Cray — which had already won the contracts from the Department of Energy for the exascale systems — in 2019 for $1.4 billion.
Part Evolution, Part Revolution
Jeff McVeigh, vice president and general manager of Intel’s Super Compute Group, told The New Stack that the transition to exascale computing is part evolutionary and part revolutionary.
“Could you say you can do that job at half an exascale but it just takes twice as long? True,” McVeigh said. “That’s why it’s this a bit of an evolution from where we are today. But it also sets a new bar for what’s possible for those workloads and when developers realize that a system like Aurora, which is open for others to have access to, that they can utilize that for new discoveries, that really changes how people approach it. It’s both. We’ve evolved over time, but then once we hit a big marker for a number, it helps to rethink how we approach things.”
There had been a number of developments over the past decade to get the industry to the doorstep of exascale computing. The rise of accelerators — particularly GPUs from Nvidia, AMD and now Intel, though others like field-programmable gate arrays (FPGAs) and data processing units (DPUs) are seeing growing use — have enabled increasingly powerful CPUs to do more compute by offloading some tasks. In addition, interconnect fabrics, like the Slingshot mesh HPE inherited via the Cray acquisition, have helped drive performance.
“What we’re able to do with standard Ethernet and the actual packet we put in that Ethernet is really differentiated in the market,” Hotard said. “We’ve pretty much have seen it in every system we’ve launched. Even within the first-generation Slingshot systems, which were a combination between a Mellanox NIC and an HPE switch, we see massive performance advances leveraging that interconnect fabric.”
McVeigh said Intel aimed to make it easier for enterprises to work with its CPUs and GPUs, noting the disaggregated nature of Intel’s chip architecture, which includes using advanced packaging technologies and common software across both Sapphire Rapids and Ponte Vecchio.
“I can use what I have on my Xeon processor, move it over to the GPU with a common set of APIs and function calls, and then even supporting multiple architectures,” he said, pointing to the Intel oneAPI programming model for simplifying development across multiple architectures, including CPUs, GPUs and FPGAs.
Getting to Exascale for All
As these systems come online, a challenge will be making them broadly accessible. HPC for a long time was the private playground of scientists, research institutions and the largest of enterprises. The need for more compute power in an enterprise IT environment awash in data and leveraging technologies like AI, machine learning and automation spans across most organizations of various sizes.
The cloud will play a role in enabling greater access, as will the increasingly open and distributed nature of compute. The industry will have to make progress in lowering the power and cooling costs of these systems to make them more efficient as well, both Hotard and McVeigh said. Once that happens, the benefits of having exascale capabilities will reach a wider audience.
“These systems are enormously expensive to get to exascale today,” McVeigh said. “The energy efficiency also applies to the cost-efficiency. Eventually, these types of computing will be within the reach of everybody, but it’s going to require us to bring down the power and the cost to do so.”
Every new level of computing opens up ideas of how to architect the software, how to use larger data sets and how to take advantage of emerging technologies, he said.
“Until we had much more compute as well as a lot of data available, we were stuck in a very stagnant place with deep learning and machine learning,” he said. “Now that both are available, we’re seeing dramatic increases and we can expect the same with exascale. Once you get that next level [of performance], you start rethinking your problems that you might have overly simplified in the past.”
Feature image via Hewlett Packard Enterprise