High Performance Computing Is Due for a Transformation
Back in 1994 — yes, almost 30 years ago! — Thomas Sterling and Donald Becker built a computer at NASA called the Beowulf.
The architecture of this computer (aka the Beowulf cluster) comprised a network of inexpensive personal computers strung together in a local area network so that processing power could be shared among them. This was a groundbreaking example of a computer that was specifically designed for high-performance computing (HPC) and that was exclusively composed of commodity parts and freely available software.
The Beowulf cluster could be used for parallel computations in which many calculations or processes are carried out simultaneously between many computers and coordinated with message-passing software. This was the beginning of Linux and open source for HPC, and that made the Beowulf truly revolutionary. For the next 10ish years, more and more people followed the Beowulf model. In 2005, Linux took the No. 1 position at top500.org, and it’s been the dominant operating system for HPC ever since.
The basic architecture of a Beowulf cluster starts with an interactive control node(s) where users log in to and interact with the system. The compute, storage and other resources are all connected to a private network (or networks). The software stack includes Linux, operating system management/provisioning (e.g., Warewulf), message passing (MPI), other scientific software and optimized libraries and a batch scheduler to manage the user’s jobs.
Over time, these systems have become more complicated with multiple tiers of storage and groups of compute resources, but the basic Beowulf framework has remained the same for thirty years. So, too, has the HPC workflow; from a user perspective, we have not made lives easier for HPC consumers for over three decades now! Generally, every HPC user has to follow the same general steps for all HPC systems:
- SSH into interactive node(s).
- Research and understand the storage system configuration and mount points.
- Download source code to the right storage path.
- Compile the source code taking into consideration the system or optimized compilers, math libraries (and locations), MPI, and possibly storage and network architecture.
- Upload data to compute onto the right storage path (which might be different from source code path above).
- Research the resource manager queues, accounts, and policies.
- Test and validate the compiled software against test data.
- Monitor job execution and verify proper functionality.
- Validate job output.
- Repeat as necessary.
- Download the resulting data for post-processing or further research.
The Ever-Growing Cost of Using a 30-Year-Old HPC Architecture
Our continued use of the legacy HPC framework is exacting a costly toll on the scientific community by way of lost opportunities, unclaimed economies of scale and shadow IT costs.
Lost opportunities include the researchers and organizations that cannot make use of the legacy HPC computing architecture and instead are stuck using non-supportable, non-scalable and non-professionally maintained architectures. For example, I’ve met multiple researchers using their laptops as their computing infrastructure.
Other lost opportunities include the inability to accommodate modern workloads, many of which are insufficiently supported by the legacy HPC architecture. For example, it is nearly impossible to securely integrate the traditional HPC system architecture into CI/CD pipelines for automated training and analytics; simpler development and resource front-ends such as Jupyter (discussed later); jobs of ever-increasing diversity; and multi-prem, off-prem and even cloud resources.
Also, many enterprises have demonstrated resistance to legacy system architectures like Beowulf. “We don’t want our system administrators using Secure Shell (SSH) anymore, and Beowulf requires all users using SSH to interface with the system!”
When IT teams have to build custom systems for particular needs and usage (which is what is happening now at many scientific centers), they cannot leverage the hardware investments effectively because each “system” exists as an isolated pool of resources. We are seeing this now with centers building completely separate systems for compute-based services and Jupyter with Kubernetes. Going unclaimed are the economies of scale that could be achieved if HPC resources properly supported all of these use cases.
Moreover, in far too many cases research teams are trying to build their own systems or using cloud instances outside of IT purview, because they feel IT is not providing them the tools that they need for their research. While the cloud has made it easy for some forms of computation, it doesn’t always make sense over local on-prem resources or if you’re locked into a single cloud vendor.
These unfortunate truths are stifling research and scientific advancements.
Hints of Progress?
Certainly, a few things have come along that have made the experience for HPC users a bit easier. Open OnDemand, for example, is a fantastic way to encapsulate the entire Beowulf architecture and give it back to the user as an http-based (i.e., web-based) graphical interface. OnDemand offers great value in providing a more modern user interface (UI) than SSH, but many sites have found that it has not significantly lowered the barrier of entry because the user still has to understand all of the same steps outlined above.
Another improvement is Jupyter Notebooks, which has been a huge leap in terms of making life better for researchers and developers. Often used in academia for teaching purposes, Jupyter helps researchers do real-time development and run “notebooks” using a more modern interactive, web-based interface. With Jupyter, we’re finally seeing the user’s experience evolving — the list of steps is simplified.
However, Jupyter is not generally compatible with the traditional HPC architecture, and, as a result, it has not been possible to integrate with existing HPC architectures. As a matter of fact, a number of traditional HPC centers run their traditional HPC systems on one side, and they use their Jupyter system on the other side to run on top of Kubernetes and enterprise-focused infrastructures. True, you can use Open OnDemand plus Jupyter to merge these approaches, but that recomplicates the process for users — adding more and different steps that make the process difficult.
Containers Lead the Way to a More Modern HPC World
Containers have served as a “Pandora’s Box” (in a good way!) to the HPC world by demonstrating that there are numerous innovations that have occurred in the non-HPC spaces which can be quite beneficial to the HPC community.
The advent of containers in enterprise was via Docker and the like, but these container implementations required privileged root access to operate and thus would open up security risks to HPC systems by allowing non-privileged users access to run containers. That’s why I created the first general-purpose container system for HPC — Singularity — which immediately was adopted by HPC centers worldwide due to the massive previously unmet demand. I have since moved Singularity into the Linux Foundation to guarantee that the project will always be for the community, by the community and free of all corporate control. As part of that move, the project was renamed to Apptainer.
Apptainer has changed how people think about reproducible computing. Now applications are much more portable and reusable between systems, researchers and infrastructures. Containers have simplified the process of building custom applications for HPC systems as they can now easily be encapsulated into a container that includes all of the dependencies. Containers have been instrumental in starting the process of HPC modernization, but it is just the first step to making lives better for HPC users. Imagine what comes next as we approach the transformation of HPC driving the next generation HPC environments.
What Is to Come?
It is time for the computing transformation: the generalized HPC architecture needs to be modernized to be able to better provide for a wider breadth of applications, workflows and use cases. Taking advantage of modern infrastructure innovations (cloud architecture, hardware such as GPUs, etc.), we must build HPC systems that support not only the historical/legacy use cases but also the next generation of HPC workloads.
At CIQ, we’re currently working on this and have been developing a solution that will make HPC approachable for users of all experience levels. The vision is to provide a modern cloud native, hybrid, federated infrastructure that will run clusters on-premises and multipremises, in the cloud and multicloud, even in multiple availability regions in multiclouds.
A gigantic distributed computing architecture will be stitched together with a single API, offering researchers total flexibility in locality, mobility, gravity and data security. In addition, we aim to abstract away all the complexity of operation and minimize the steps involved in running HPC workflows.
Our goal is to enable science by modernizing HPC architecture — both to support a greater breadth of job diversity and to lower the barrier of entry to HPC to more researchers, optimizing the experience for all.