Where are you using WebAssembly?
Wasm promises to let developers build once and run anywhere. Are you using it yet?
At work, for production apps
At work, but not for production apps
I don’t use WebAssembly but expect to when the technology matures
I have no plans to use WebAssembly
No plans and I get mad whenever I see the buzzword
Cloud Services / Kubernetes / Operations

Getting the Most from Kubernetes Autoscaling

Using Horizontal and Vertical Pod Autoscaling to ensure apps have the resources to meet demand while reducing costs and avoiding over-provisioning
Mar 21st, 2023 7:06am by
Featued image for: Getting the Most from Kubernetes Autoscaling

In cloud computing, autoscaling is the process of dynamically adjusting the number of instances or resources in response to changes in demand or workload. This process improves resource use and reduces costs while increasing the application’s reliability, availability and scalability.

Horizontal autoscaling is a process by which the number of running instances of a service or application is automatically increased or decreased (“scaling out” and “scaling in,” respectively). Meanwhile, vertical autoscaling involves dynamically adjusting the resources allocated to an instance, such as the amount of memory, CPU or disk storage.

Kubernetes implements two major autoscalers: Horizontal Pod Autoscaler (HPA) and Vertical Pod Autoscaler (VPA). Both help ensure that applications have the resources they need to meet demand while avoiding over-provisioning and reducing costs.

This article will explore how to leverage HPA and VPA for optimal autoscaling capabilities effectively. We’ll also discuss why HPA alone isn’t enough and how HPA and VPA complement each other.

Understanding the Intricacies of Horizontal and Vertical Pod Autoscaling

HPA scales the cluster capacity horizontally by dynamically adjusting the number of replicas based on observed CPU use and other metrics like memory and incoming network traffic.

VPA is a separate controller that runs in the cluster and watches how the workload responds to requests. However, as the pod resources reach a certain limit, the VPA responds differently than HPA. Instead of adding more replicas to our workloads, the Kubernetes VPA scales the cluster capacity vertically by allocating more CPU or memory resources to the existing pods.

HPA and VPA require us to use a Kubernetes Metrics Server or another source that implements the Kubernetes metrics API. The Metrics Server aggregates container resource metrics from our cluster and exposes them for Kubernetes autoscaling pipelines.

How HPA and VPA Work

Kubernetes’s HPA adjusts the size of our cluster by automatically increasing or decreasing the number of replicas of a pod. But HPA must act before all resources are consumed. This is why HPA uses some metrics configuration to know at which point it should add more replicas. For example, when a pod’s memory utilization reaches the percentage defined in the metrics, it increases the number of pods running in the cluster.

Besides memory, we can also configure CPU usage to trigger HPA scaling when desired.

However, even with HPA and VPA, sudden increases in activity, known as load bursts, can be too much for our configured metrics and disrupt our system. Therefore, we need to set our metrics levels so that HPA and VPA can scale out quickly enough for our pods to handle the increased load. Otherwise, the application users might notice the degradation or unavailability of the service.

If usage drops below the target, our pods might have too much CPU and memory for the current workloads. Kubernetes VPA can then scale down our workloads and reduce CPU or memory.

Kubernetes autoscaling policies are defined based on specific metrics, like CPU utilization, memory usage or incoming network traffic. These set policies trigger the addition or removal of instances or adjust resources accordingly.

We can use the Kubernetes VPA and HPA for different purposes in Kubernetes. Applications with variable and unpredictable resource requirements benefit more from VPA than HPA because VPA adjusts the resource requests of individual pods, like CPU or memory, and provides more fine-grained control over resource allocation.

On the other hand, applications with a predictable and consistent demand for resources benefit more from HPA because it scales the number of pod replicas, balancing the load across multiple pods and nodes.

Simply adding more pods (horizontal scaling) isn’t always ideal. Certain applications will benefit more from having more resources available directly (vertical scaling) versus having additional instances available (horizontal scaling).

In general, vertical scaling is beneficial for applications with high resource requirements and a large number of users that need to be served in real time, like a database server.

On the other hand, horizontal scaling is typically beneficial when our applications are divided into smaller, independent parts that can run in parallel. This type of scaling is often used for applications that handle many requests, as it allows for better load balancing and improved reliability. This is the case of a web server that serves static content to many users.

Both vertical and horizontal scaling have their trade-offs, and the best approach may vary depending on the specific requirements of an application. Combining both scaling techniques is necessary to achieve optimal performance and reliability.

Using Both HPA and VPA

It’s generally recommended not to use VPA and HPA on the same Kubernetes deployment. According to Kubernetes’ README:

“Vertical Pod Autoscaler should not be used with the Horizontal Pod Autoscaler (HPA) on CPU or memory at this moment. However, we can use VPA with HPA on custom and external metrics.”

The fundamental difference between custom and external metrics is whether the application originating the metric runs inside or outside our Kubernetes cluster, respectively. Examples of custom metrics our cluster provides include reads-per-second, writes-per-second and network performance. An external metric could be the number of undelivered messages in a pub/sub queue originating from an application outside our cluster.

Studies show that most organizations prefer HPA over VPA in Kubernetes because HPA is easier to set up and use, provides more control over the scaling process and integrates better with other Kubernetes features, such as custom and external metrics.

According to research by Datadog from October 2021, 40 percent of organizations running Kubernetes in production use HPA. Meanwhile, less than 1 percent of these organizations use VPA. In short, for most companies, it’s either HPA or nothing. And besides the fact that these organizations are overspending their cloud budgets, the massive amount of underused containers in production contributes to cloud waste.

Fortunately, it’s possible to configure HPA and VPA to work together. This process requires separating the events that trigger the autoscaling so that they aren’t triggered simultaneously. If they are, it can result in both vertical and horizontal scaling occurring, potentially resulting in unnecessary pods (horizontal) or over-allocation of resources (vertical).

Previously, we had to choose between vertically or horizontally scaling our Kubernetes clusters. But that changed when StormForge introduced bi-dimensional Kubernetes pod autoscaling with StormForge Optimize Live.

Bi-Dimensional Autoscaling with StormForge Optimize Live

StormForge Optimize Live enables simultaneous vertical and horizontal pod autoscaling. It’s an easier, more effective approach than going through the typical process of configuring HPA and VPA to work together.

StormForge Optimize Live applies machine learning (ML) algorithms to the application’s telemetry data based on real-time resource utilization and demand. It determines a pod’s ideal CPU, memory and the optimal number of pod replicas in a cluster.

StormForge Optimize Live can also detect whether an HPA is on and recommends where our scale point should be with that HPA configuration. This ensures we’re always very near the usage curve to maintain peak performance and reliability while reducing cloud waste and costs. With Optimize Live, we can automatically optimize resource use and minimize over-provisioning without needing to tweak, tune and configure our systems manually.

Optimize Live analyzes captured observability data and recommends optimized configurations for CPU and memory on a container basis and in real time.

Understand Your Application Before Autoscaling

Understanding an application’s requirements is crucial to selecting the best autoscaling option, as it helps ensure that the system can handle the application’s varying demands efficiently. Some critical points to consider include traffic patterns, resource use, performance goals and cost constraints.

With a clear understanding of these requirements, we can select the optimal autoscaling strategy, whether that’s horizontal scaling, vertical scaling or a combination of both, and configure the system accordingly.


Kubernetes’s autoscaling mechanism allows our clusters to respond automatically to changes in resource demand. Kubernetes HPA adjusts the cluster horizontally by adjusting the cluster size to have the appropriate number of pods based on configuration thresholds. In contrast, VPA adjusts the pods vertically by adding or removing memory or CPU capacity. The problem is that most organizations only do horizontal autoscaling, while VPA becomes underutilized. As a consequence, many resources are wasted.

HPA and VPA are essential techniques for efficiently managing resources in a Kubernetes cluster. HPA ensures you have enough resources to handle traffic spikes while reducing costs during low-traffic periods. VPA allows you to ensure your applications have an optimal amount of resources to run while minimizing costs. In combination, HPA and VPA can help you achieve more efficient resource usage, reduce costs and improve the performance and scalability of your applications.

Group Created with Sketch.
TNS owner Insight Partners is an investor in: StormForge, Pragma, Simply.
THE NEW STACK UPDATE A newsletter digest of the week’s most important stories & analyses.