Containers for High Performance Computing
With a few adjustments to the technology, Docker containers could bring heretofore unseen efficiencies to supercomputers and high-performance computing (HPC), so predicts Docker Technical Account Manager Christian Kniep, in a series of recent talks at ContainerDays Hamburg, and elsewhere.
To date, the supercomputing world has taken scarce notice of containers, even though they could bring greater efficiencies to an ecosystem obsessed with performance. To a certain extent, this is due to architectural differences — the tightly coupled model of supercomputing is at odds with the loosely coupled “microservices” architectures that containerization lends itself too. Security concerns play a role too, as worries about running a Docker daemon on worker nodes spooks some.
But some jobs naturally lend themselves to the HPC model, such as work in artificial intelligence, which can take advantage of the muscular vector processing capabilities that supercomputers can offer through their plentiful GPUs, Kniep notes. This week, the SC500, which runs a twice-annual roundup of the world’s fastest supercomputers, announced that the collected power of all the computers on the list exceeded one exaflop, at 1.22 exaflops. An exaflop is a thousand petaflops or a quintillion floating point operations per second. GPUs were used 110 of these systems, 98 of which were Nvidia’s.
Supercomputers are designed to run large workloads — such as weather forecasting or molecular modeling — as a single application across hundreds or even thousands of servers. Despite the initial similarities to cloud computing, some fundamental differences do exist between the two architectures, Kniep points out. And these differences must be accounted for if Docker or other cloud-native vendors wish to service this market.
For instance, to take advantage of specific acceleration hardware such as GPUs or high-speed interconnects, HPC applications are often host-specific — they were designed for one platform. In this sense, they are hardware-specific, rather than the hardware-agnostic approach that Docker adheres to. Secondly, HPC programs tend to be tightly-coupled. They may have multiple components, but they all need access to a shared memory space.
“HPC needs to run on shared resources,” as Kniep said in a talk at the 2018 HPC Advisory Council Swiss Conference (he gave a similar talk at ContainerDays EU last week). Unlike most Docker setups today, in an HPC setup, multiple nodes will be reading and writing to a single shared file system. Each node must also securely maintain its own set of data, without intrusion from other nodes.
Typically, Docker that gives the container the same privileges set within the container itself. This practice is unsecured, however, since someone can spin up a container with somebody else’s ID, giving them the full permissions of that user. For this reason, user and group permissions set by the container should not be trusted for security purposes.
As a proof-of-concept to solve this issue, Kniep developed a mini-proxy for the Docker engine that overwrites the container’s permissions with the permissions of the user who started the container. In this case, the permissions of the user are re-attached to the container, rather than relying on what the container itself claims.
Another big aspect of HPC is the ability of the application to bypass the operating system kernel altogether, a common approach in HPC to speed performance by allowing the application to communicate directly to specialized hardware such as GPUs or interconnects, without going through the OS.
An initial solution to this issue might be to move host-specific drivers onto the container, which would bloat the size of the container (The Nvidia Cuda driver alone can be as large as 1GB). A better option would be to put the drivers on the host, and then map them into the container itself. Each host could then map its own characteristics onto the container, using pre-defined “mount points.”
Kniep’s Docker engine proxy also handles this task, requiring only the input from the user about the hardware requirements (i.e. a GPU, or an Infiniband connect) as well as the location on the host of the shared library.
Work Needs to be Done
With these changes, you could use Docker to stage all containerized workloads on HPC systems, both those that require shared spaces and those that don’t. This approach would offer all the flexibility of virtual machines, but at the wire speed performance favored by HPOC enthusiasts.
Several open source initiatives have already tackled this problem of bringing containers to the HPC space, most notably Shifter, and Singularity. These efforts, Kniep warned, don’t follow the specifications of the Open Containers Initiative, and may be at architectural odds with the cloud-native tools that can be built on top of OCI’s runc and containerd.
Beyond accommodating for shared memory and kernel bypass, more follow-up work would still be required, Kniep added. Work would need to be done to fit containerized workloads into HPC workload schedulers, most notably Slurm, which would pave the way for other HPC schedulers. These workload schedulers handle MPI, the complex Message Passing Interface that provides communications across different nodes running the same job.
Another nice to have feature would be support for diskless nodes, to save in deployment time. A job’s container image copied once on the shared file system, so a container engine on each node can spin up its own instance. This would be far more efficient than having thousands of container engines all download the same image from the same repository at once and then all extract the exact same file systems one by one.