Where are you using WebAssembly?
Wasm promises to let developers build once and run anywhere. Are you using it yet?
At work, for production apps
At work, but not for production apps
I don’t use WebAssembly but expect to when the technology matures
I have no plans to use WebAssembly
No plans and I get mad whenever I see the buzzword
AI / Hardware / Kubernetes / Operations

Enhance Kubernetes Scheduling for GPU-Heavy Apps with Node Templates

Kubernetes' preferred methods of scheduling workloads are not cost-effective for those jobs requiring expensive GPUs. Here's a better way to run these jobs.
Jun 7th, 2023 10:00am by
Featued image for: Enhance Kubernetes Scheduling for GPU-Heavy Apps with Node Templates
Image by Andreas Lischka from Pixabay.

Kubernetes scheduling ensures that pods are matched to the right nodes so that the Kubelet can run them.

The whole mechanism promotes availability and performance, often with great results. However, the default behavior is an anti-pattern from a cost perspective. Pods running on half-empty nodes equal higher cloud bills. This problem becomes even more acute with GPU-intensive workloads.

Perfect for parallel processing of multiple data sets, GPU instances have become a preferred option for training AI models, neural networks, and deep learning operations. They perform these tasks faster, but also tend to be costly and lead to massive bills when combined with inefficient scheduling.

This issue challenged one of CAST AI’s users — a company developing an AI-driven security intelligence product. Their team overcame it with our platform’s node templates, an autoscaling feature that boosted the provisioning and performance of workloads requiring GPU-enabled instances.

Learn how node templates can enhance Kubernetes scheduling for GPU-intensive workloads.

The Challenge of K8s Scheduling for GPU Workloads

Kube-scheduler is Kubernetes’ default scheduler running as part of the control plane. It selects nodes for newly created and yet unscheduled pods. By default, the scheduler tries to spread these pods evenly.

Containers within pods can have different requirements, so the scheduler filters out any nodes that don’t meet the pod’s specific needs.

It identifies and scores all feasible nodes for your pod, then picks the one with the highest score and notifies the API server about this decision. Several factors impact this process, for example, resource requirements, hardware and software constraints, affinity specs, etc.

Fig. 1 Kubernetes scheduling in overview

The scheduler automates the decision process and delivers results fast. However, it can be costly as its generic approaches may get you to pay for resources that are suboptimal for different environments.

Kubernetes doesn’t care about the cost. Sorting out expenses — determining, tracking and reducing them — is up to engineers, and this is particularly acute in GPU-intensive applications, as their rates are steep.

Costly Scheduling Decisions

To better understand their price tag, let’s look at Amazon EC2 P4d designed for machine learning and high-performance computing apps in the cloud.

Powered by NVIDIA A100 Tensor Core GPUs, it delivers top throughput and low latency networking and support for 400 Gbps instance networking. P4d promises to lower the cost of training ML models by 60% and provide 2.5x better performance for deep learning than earlier P3 instance generations.

While it sounds impressive, it also comes at an hourly on-demand price exceeding the cost of a popular instance type like C6a several hundred times. That’s why it’s essential to control the scheduler’s generic decisions precisely.

Fig. 2 Price comparison of p4d and c6a

Unfortunately, when running Kubernetes on GKE, AKS or Amazon Web Services‘ Elastic Kubernetes Service (EKS), you have minimal impact on adjusting scheduler settings without using components such as MutatingAdmissionControllers.

That’s still not a bulletproof solution, as when authoring and installing webhooks, you need to proceed with caution.

Node Templates to the Rescue

This was precisely the challenge one of CAST AI users faced. The company develops an AI-powered intelligence solution for the real-time detection of threats from social and news media. Its engine analyzes millions of documents simultaneously to catch emerging narratives, but it also enables the automation of unique Natural Language Processing (NLP) models for intelligence and defense.

The volumes of classified and public data that the product uses are ever-growing. That means its workloads often require GPU-enabled instances, which incur extra costs and work.

Much of that effort can be saved using node pools (Auto Scaling groups). But while helping streamline the provisioning process, node pools can also be highly cost-ineffective, leading you to pay for the capacity you don’t need.

CAST AI’s autoscaler and node templates improve on that by providing you with tools for better cost control and reduction. In addition, thanks to the fallback feature, node templates let you benefit from spot instance savings and guarantee capacity even when spots become temporarily unavailable.

Node Templates in Action

The workloads of the CAST AI client now run on predefined groups of instances. Instead of having to select specific instances manually, the team can broadly define their characteristics, for example “CPU-optimized,” “Memory-optimized” and “GPU VMs,” then the autoscaler does the rest.

This feature has given them far more flexibility, as they can use different instances more freely. As AWS adds new, highly performant instance families, CAST AI automatically enrolls you for them, so you don’t need to enable them additionally. This isn’t the case with node pools, which require you to keep track of new instance types and update your configs accordingly.

By creating a node template, our client could specify general requirements — instance types, the lifecycle of the new nodes to add, and provisioning configs. They additionally identified constraints such as the instance families they didn’t wish to use (p4d, p3d, p2) and the GPU manufacturer (in this case, NVIDIA).

For these particular requirements, CAST AI found five matching instances. The autoscaler now follows these constraints when adding new nodes.

Fig. 3 Node template example with GPU-enabled instances

Once the GPU jobs are done, the autoscaler decommissions GPU-enabled instances automatically.

Moreover, thanks to spot instance automation, our client can save up to 90% of hefty GPU VMs costs without the negative consequences of spot interruptions.

As spot prices can vary dramatically for GPUs, it’s essential to pick the most optimal ones at the time. CAST AI’s spot instance automation takes care of this. It can also ensure the right balance between the most diverse and cheapest types.

And on-demand fallback can be a blessing in mass spot interruptions or low spot availability. For example, an interrupted, not properly saved training process in deep learning workflows can lead to severe data loss. If AWS happens to withdraw at once all EC2 G3 or p4d spots your workloads have been using, an automated fallback can save you a lot of hassle.

How to Create a Node Template for Your Workload

Creating a node template is relatively quick, and you can do it in three different ways.

First, by using CAST AI’s UI. It’s easy if you have already connected and onboarded a cluster. Enter your product account and follow the screen instructions.

After naming the template, you need to select if you wish to taint the new nodes and avoid assigning pods to them. You can also specify a custom label for the nodes you create using the template.

Fig. 4 Node template from CAST AI

You can then link the template to a relevant node configuration, but you can also specify if you wish your template to use only spot or on-demand nodes only.

You also get a choice of processor architecture and the option to use GPU-enabled instances. If you select this preference, CAST AI will automatically run your workloads on relevant instances, including any new families added by your cloud provider.

Finally, you can also use restrictions such as :

  • Compute-optimized: helps to pick instances for apps requiring high-performance CPUs.
  • Storage Optimized: selects instances for apps that benefit from high IOPS.
  • Additional constraints, such as Instance Family, minimum and maximum CPU and memory limits.

But the hard fact is that the fewer constraints you add, the better matches and the higher cost savings you will get. CAST AI’s engine will take care of that.

You can also create node templates with Terraform (you can find all details in GitHub) or use API (check the documentation).


Kubernetes scheduling can be challenging, especially when it comes to GPU-heavy applications. Although the scheduler automates the provisioning process and delivers fast results, it can often prove too generic and expensive for your application’s needs.

With node templates, you get better performance and flexibility for GPU-intensive workloads. The feature also ensures that once a GPU instance is no longer necessary, the autoscaler decommissions it and gets a cheaper option for your workload’s new requirements.

We found that this quality helps build AI apps faster and more reliably — and we hope it will support your efforts, too.

Group Created with Sketch.
THE NEW STACK UPDATE A newsletter digest of the week’s most important stories & analyses.