Kubernetes Evolution: From Microservices to Batch Processing Powerhouse
Kubernetes has come a long way since its inception in 2014.
Initially focused on supporting microservice-based workloads, Kubernetes has evolved into a powerful and flexible tool for building batch-processing platforms. This transformation is driven by the growing demand for machine learning (ML) training capabilities, the shift of high-performance computing (HPC) systems to the cloud, and the evolution towards more loosely coupled mathematical models in the industry.
Recent work by PGS to use Kubernetes to build a compute platform that is equivalent to the world’s top seventh supercomputer with 1.2MvCPUs but running in the cloud and on Spot VMs is a great highlight of this trend.
In its early days, Kubernetes was primarily focused on building features for microservice-based workloads. Its strong container orchestration capabilities made it ideal for managing the complexity of such applications.
However, batch workloads users frequently preferred to rely on other frameworks like Slurm, Mesos, HTCondor, or Nomad. These frameworks provided the necessary features and scalability for batch processing tasks, but they lacked the vibrant ecosystem, community support, and integration capabilities offered by Kubernetes.
In recent years, the Kubernetes community has recognized the growing demand for batch processing support and has made significant investments in this direction. One such investment is the formation of the Batch Working Group, which has undertaken several initiatives to enhance Kubernetes’ batch processing capabilities.
The Batch Working Group has built numerous improvements to the Job API, making it more robust and flexible to support a wider range of batch processing workloads. The revamped API allows users to easily manage batch jobs, offers scalability, performance and reliability enhancements.
Kueue (https://kueue.sigs.k8s.io/) is a new job scheduler developed by the Batch Working Group, designed specifically for Kubernetes batch processing workloads. It offers advanced features such as job prioritization, backfilling, resource flavors orchestration and preemption, ensuring efficient and timely execution of batch jobs while keeping your resources usage at maximum efficiency.
The team is now working on building its integrations with various frameworks like Kubeflow, Ray, Spark and Airflow. These integrations allow users to leverage the power and flexibility of Kubernetes while utilizing the specialized capabilities of these frameworks, creating a seamless and efficient batch-processing experience.
There are also a number of other capabilities that the group is looking to deliver. This includes job-level provisioning APIs in autoscaling, scheduler plugins, node-level runtime improvements and many others.
As Kubernetes continues to invest in batch processing support, it becomes an increasingly competitive option for users who previously relied on other frameworks. There is a number of advantages Kubernetes brings to the table that includes:
- Extensive Multitenancy Features: Kubernetes provides robust security, auditing, and cost allocation features, making it an ideal choice for organizations managing multiple tenants and heterogeneous workloads.
- Rich Ecosystem and Community: Kubernetes boasts a thriving open-source community, with a wealth of tools and resources available to help users optimize their batch-processing tasks.
- Managed Hosting Services: Kubernetes is available as a managed service on all major cloud providers. This offers tight integrations with their compute stacks, enabling users to take advantage of unique capabilities, and simplified orchestration of harder-to-use scarce resources like Spot VMs or accelerators. Using these services will result in faster development cycles, more elasticity and lower total cost of ownership.
- Compute orchestration standardization and portability: Enterprises can choose a single API layer to wrap their computational resources to mix their batch and serving workloads. They can use Kubernetes to reduce lock-in to a single provider and get the flexibility of leveraging the best of all that the current cloud market has to offer.
Usually, a user’s transition to use Kubernetes also involves containerization of their batch workloads. Containers themselves have revolutionized the software development process and for computational workloads, they offer a great acceleration of release cycles leading to much faster innovation.
Containers encapsulate an application and its dependencies in a single, self-contained unit, which can run consistently across different platforms and environments. They eliminate the “it works on my machine” problem. They enable rapid prototyping and faster iteration cycles. If combined with cloud hosting it allows agility that helps HPC and ML-oriented companies innovate faster.
The Kubernetes community still needs to solve a number of challenges, including the need for more advanced controls of the runtime on each host node, and the need for more advanced Job API support. HPC users are accustomed to having more control over the runtime.
Setting up large-scale platforms using Kubernetes on premises still requires a significant amount of skill and expertise. There is currently some fragmentation in the batch processing ecosystem, with different frameworks re-implementing common concepts (like Job, Job Group, Job Queueing) in different ways. Going forward we’ll see these addressed with each Kubernetes release.
The evolution of Kubernetes from a microservices-focused platform to a powerful tool for batch processing demonstrates the adaptability and resilience of the Kubernetes community. By addressing the growing demand for ML training capabilities, HPC migration to the cloud, Kubernetes has become an increasingly attractive option for batch-processing workloads.
Kubernetes’ extensive multitenancy features, rich ecosystem, and managed hosting services on major cloud providers make it a great choice for organizations seeking to optimize their batch-processing tasks and tap into the power of the cloud. If you want to join the Batch Working Group and help contribute to Kubernetes then you can find all the details here. We have regular meetings, a Slack channel and an email group that you can join.