Achieving Cloud-Native HPC Capabilities in a Mixed Workload Environment
Evolving Enterprise High-Performance Workloads
As the lines blur between traditional workloads and new applications like deep-learning, containers becoming are a hot topic in enterprise HPC as well. Not surprisingly, like their internet colleagues deploying cloud-scale services, HPC architects see value in cloud-native approaches. HPC developers have been building distributed applications since before clusters were cool, open source is in their DNA, and they also appreciate the elegance of parallelism, resilience and horizontal scaling. While the term CI/CD didn’t originate in HPC, some HCP admins face the same challenge as their DevOps colleagues, needing to deploy new functionality quickly and reliably.
Barriers to Adoption
So why don’t we see a mass migration to the cloud and widespread availability of containerized HPC applications expressed as Kubernetes YAML templates? As is often the case, the answer is complicated.
There are several issues, but we explore two in more detail below.
- Significant investments in certified and trusted application workflows.
- Technical considerations related to workload management.
After decades of enterprises’ investments in HPC, there are thousands of thoroughly exercised math libraries and domain-specific applications. From linear algebra to differential equation routines, to explicit parallel solvers used in vehicle crash simulation, that form the core of innovation in the industry verticals they are used in and are thus mandatory parts of enterprise HPC architectures.
Like the applications themselves, decades have been spent refining workload managers to address recurring workload patterns. Among these patterns are interactive jobs, MPI parallel jobs, multi-step workflows, and parametric sweeps. Particularly in enterprise HPC environments, where time is money, users rely on features like hierarchical resource sharing, topology-aware scheduling, backfill scheduling, and policy-based pre-emption to get the most out of balancing utilization of expensive resources with business priorities that may reduce throughput. Interestingly, new workloads in Artificial Intelligence (AI) increasingly rely on the computational power of GPUs, performing the matrix arithmetic needed to train neural networks faster. Whether these workloads exist in containers, VMs or run on bare-metal, they rely on techniques pioneered in HPC to manage GPUs as schedulable, consumable resources.
Evolution vs. Revolution
Despite the barriers, change is in the air. Increasingly, users are leveraging public cloud resources and embracing containerization to enable new service-delivery models. While it’s still early days, for new workloads, some users are in a “revolutionary” mood seeking to solve problems by embracing cloud-native approaches. As Kubernetes becomes ubiquitous in the cloud — GKE, AKS, EKS, etc. — this viewpoint is slowly gaining momentum.
Others are more pragmatic, thinking in terms of “evolution” — keeping applications as they are but using containers to make applications easier to share, more portable across clouds and more resource efficient as they chart their course to cloud-native computing.
Managing Co-Existence Is Key
As enterprise HPC users experiment with new cloud-native software, existing applications need to be integrated. And the last thing a CIO — or even CFO! — wants is separate, replicated clusters for HPC applications, Kubernetes applications and Hadoop MapReduce or Spark workloads. Replicating environments inhibits resource sharing and causes infrastructure and management costs to soar.
To get around this, we see customers taking two distinct approaches to enable co-existence.
- The first approach is to run containerized and non-containerized workloads on existing HPC clusters. HPC workload managers increasingly support containerized applications while preserving the HPC-oriented scheduling semantics. This approach supports containers but does not support the software services required by native Kubernetes applications.
- A second approach is to devise ways of running HPC workloads on local or cloud-resident Kubernetes clusters. While less common and more cutting-edge, this solution is now viable and likely to become more common as the number of native Kubernetes applications grow.
A good example of a site using the second, multi-tenant strategy is the Institute for Health Metrics and Evaluation (IHME), a leading, independent health research center at the University of Washington that houses the world’s largest population health data repository. IHME operates a significant HPC environment for various statistical modeling applications and has also embraced Kubernetes for micro-service applications. Using software solutions including Navops Command (for resource management and multitenancy) and Univa Grid Engine (for workload management and scheduling), existing HPC applications can be scheduled to run transparently inside of Univa Grid Engine as a service while using Kubernetes as the underlying substrate. This approach side-steps a significant barrier to cloud-native technology adoption by allowing traditional HPC workloads to run on a shared Kubernetes cluster. In a case study presented at KubeCon + CloudNativeCon in December 2017, IHME described how its been able to reduce infrastructure requirements, preserve software investments and modernize applications at their own pace while delivering improved service levels.
Helping HPC Users Cross the Cloud-Native Chasm
As HPC users embrace new cloud-native development and deployment models, they are looking for solutions that can help them evolve to not only hybrid-cloud environments but hybrid application environments as well. Enterprise HPC users need the capacity to run traditional HPC workloads, simple containers, micro-service-based applications, and even Mesos frameworks on a cost-efficient, shared infrastructure.
Regardless of how users plot their course toward cloud-native HPC, Univa has the tools and expertise to help. Readers interested in learning more can visit http://univa.com.
Univa is the leading independent provider of software-defined computing infrastructure and workload orchestration solutions. Univa offers a variety of solutions that can help HPC users on their journey to cloud-native computing. Key offerings mentioned are listed below.
- Univa Grid Engine is a widely used distributed resource manager optimized for both traditional and containerized workloads.
- Navops Command brings advanced scheduling capabilities to Kubernetes including support for multistep workflows and hierarchical resource sharing based on users, groups, and projects. When used together with Univa Grid Engine it allows customers to run existing workloads without modification on a variety of Kubernetes environments.
- Univa’s Universal Resource Broker (URB) is an open source project available for both Grid Engine and Kubernetes clusters enabling a variety of popular Mesos compatible frameworks (Hadoop, Spark, Storm, Marathon, and others) to run alongside traditional and containerized workloads.
- Navops Launch automates the deployment of local and cloud-based clusters based on re-usable templates. It provides multiple cloud-specific adapters allowing HPC sites to implement “cloud bursting” strategies, seamlessly shifting workloads to the cloud based on configurable policies in a fashion that is transparent to users.