Why Your Kubernetes Ship Is Sunk without Machine Learning
With the rise of containerized services based on service-oriented architecture (SOA), the need for orchestration software like Kubernetes is rapidly increasing. Kubernetes is ideally suited for large-scale systems, but its complexity and lack of transparency can result in increased cloud costs, deployment delays and frustration among stakeholders. Used by large enterprises to scale their applications and underlying infrastructure vertically and horizontally to meet varied loads, the fine-grained control that makes Kubernetes so adaptable also makes it challenging to tune and optimize effectively.
The Kubernetes architecture makes autonomous workload allocation decisions within a cluster. However, Kubernetes in itself doesn’t ensure high availability. It will easily operate in a production environment with only one primary node. Similarly, Kubernetes doesn’t assist in cost optimization. It doesn’t give an alert or warning if, for example, the servers in a cluster are only at 20% utilization, which could signal that we are wasting money on over-provisioned infrastructure.
Optimizing our Kubernetes clusters to balance performance and reliability with the cost of running those clusters is essential. In this article, we’ll learn ways to optimize Kubernetes with the help of machine learning (ML) techniques.
Kubernetes Complexity Makes Manual Optimization Futile
By default, Kubernetes allocates considerable computing and memory resources to prevent slow performance and out-of-memory errors during runtime. However, constructing a cluster of nodes with default values results in wasted cloud costs and poor cluster utilization without ensuring adequate performance. Also, as the number of containers grows, so does the number of variables (CPU, RAM, requests, limits and replicas) to be considered.
In a K8s cluster, we must configure several parameters, including those outlined below. But, as the following sections show, manually optimizing these parameters is time-consuming and ineffective due to Kubernetes’ complexity.
CPU and Memory
CPU defines the compute processing resources, while memory defines the memory units available to the pod. We can configure a request value for the CPU and memory the pod can consume. If the node running the pod has available resources, the pods can consume them up to their set CPU and memory limits.
Setting up CPU and memory limits is essential, but it isn’t easy to find the right setting to ensure efficiency. To optimize these limits, we need to predict our future CPU, and memory needs — something that’s challenging to calculate. Then, to optimize these resources, we have to fine-tune the value, which is tedious and time-consuming.
Besides the Kubernetes technical components, such as CPU or memory, we should also look at the application-specific parameters. These include heap size, worker threads, database connection pools and garbage collection, to name a few, as these can also have a significant impact on efficient resource utilization.
Take a Java application as an example. Configuring your JVM (Java Virtual Machine), which involves determining the available memory and the heap size, plays a crucial role in sizing. Performance benchmarks, such as those for Java applications, show that with a memory allocation of 256Mi (Mebibyte) or 512Mi, there’s still a heap size of around 127Mi. There’s no immediate reason for allocating 512Mb in this setup since the heap size remains the same, with 50% of it. However, once we go above 512Mi, the heap size also grows exponentially.
In addition to heap size, garbage collection is a performance metric that must be configured. So knowing how to tune this optimally is also key. Typically, if your memory size settings are off, the garbage collector will also run inefficiently. In other words, the better the JVM heap size is tuned, the more optimally the garbage collector should run.
Access to System Resources
Containerized applications typically have access to all system-level resources, but it doesn’t mean your single pod runtime uses them optimally. It might be beneficial to run multiple threads of the same application instead of allocating larger CPU or memory values to a single container.
Besides the application container itself, resource performance impact could come from other factors, such as a database. Where performance might be fine from a single app container to the database, it might become challenging when multiple pods connect to the database simultaneously. Database pooling could be a possible help here.
Monitoring the health state of your containerized applications in a Kubernetes environment is done using Kubernetes probes. We can set up liveness, readiness and startup probes in the K8s configuration.
The liveness probe checks the health of the application. It’s especially helpful for validating whether an application is still running (deadlock). The liveness probe not only checks for the running state of the container but also tries to guarantee the application within the container is up and running. The pod might be ready, but that doesn’t mean the application is ready. The easiest liveness probe type is a
GET HTTP request, which results in an
HTTP 200-399 RESPONSE message.
The readiness probe checks whether the application is ready to accept traffic. If the readiness probe is in a failed state, no IP address is handed out to the pod, and the pod gets removed from the corresponding service. The readiness probe guarantees that the application running within the container is 100% ready to be used. The readiness probe always expects an
HTTP 200 OK RESPONSE as feedback, confirming the app is healthy.
The start-up probe checks whether the container application has started. This probe is the first one to start, and the other two probes will be disabled until the start-up probe is in a successful state.
Configuring Kubernetes Probes
Kubernetes probes provide several different parameters that can be configured. Key here is fine-tuning the probe configuration, valid for both liveness and readiness health probes:
timeoutSecondsreflects the number of seconds after which the probe times out. The default is one second. If this parameter is set too low or too high, it might result in failing containers or failing applications. This could result in the user receiving error messages when trying to connect to the workload.
periodSecondsreflects the frequency (in the number of seconds) to perform the probe check. Similar to the
timeoutSecondsparameter, finding an accurate setting is important. If you check too frequently, it might saturate the application workload. If you don’t check frequently enough, it might result in a failing application workload.
failureThresholdreflects the number of failed requests/responses. The default here is three. This means that, by default, a container could be flagged as failed after three seconds, assuming the
periodSecondsare configured with the default values.
initialDelaySecondsreflects the wait state for the probes to start signaling after the container has started successfully. The default is zero, meaning a running container sends probes immediately after a successful startup.
Horizontal Pod Autoscaling (HPA)
The Horizontal Pod Autoscaler scales the workload by deploying more pods to meet the increased demand. When the load decreases, it terminates some pods to meet the decreased demand.
By default, HPA scales out (adds pods) or scales in (removes pods) based on target CPU utilization. Alternatively, we can configure it based on memory utilization or a custom usage metric.
Although adding more pods (scaling out) might appear to result in better application performance, that’s not always the case. As we saw earlier when discussing JVM heap size and garbage collector tuning, sometimes adding more pods won’t improve the service’s performance. Like the other sizing complexity already discussed, fine-tuning the horizontal scaling of container workloads can be challenging to perform manually.
Vertical Pod Autoscaling (VPA)
The opposite of horizontal scaling is vertical scaling, which involves resizing underperforming pods with larger CPU and memory limits or reducing CPU and memory limits for underutilized Pods.
Similar to the complexity of right-sizing HPA, the same challenges exist with right-sizing VPA. Workloads are typically dynamic. A change in active users, peak seasonal load, unplanned outages of certain cluster components, and so on, are all factors to consider when performing sizing and tuning. Therefore, we can define VPA configuration for adjusting a pod’s CPU and memory limits, but it’s difficult to determine the new values.
It should be noted that, by default, VPA can’t be combined with HPA, as both scale according to the same metric (CPU target utilization).
Replicas indicate the number of identical running pods required for a workload. We can define the value of replicas in the K8s configuration. An HPA can also control the number of replicas for a pod.
It’s difficult to determine the exact number of replicas that should be configured for a pod because, if the workload of the pod changes, some replicas could become under-utilized. Additionally, it’s tedious to update the pod configuration manually.
Manually configuring and fine-tuning these parameters becomes progressively more challenging as the complexity of the cluster increases. The diagram below illustrates the different Kubernetes parameters we can tune to optimize resource usage.
Optimizing Kubernetes with Help from Machine Learning
With minimal insight into the actual operational behaviors of the containers, it’s challenging for the DevOps team to determine the optimal values for resources. We can use ML at various levels of container resource optimization.
We can understand usage patterns with the help of state-of-the-art ML algorithms. By gathering granular container data from cloud monitoring frameworks like Prometheus, learning the activity patterns and applying sophisticated algorithms to generate optimal results, ML can produce precise and automatable recommendations. This approach replaces static resource specifications with dynamic specifications derived from ML-backed analyses of usage patterns.
Approaches to ML-Based Optimization
Optimizing Kubernetes applications is a multi-objective optimization problem. Here, the resource configuration settings act as input variables, while performance, reliability and cost of running the application act as outputs. ML-based optimization can be approached in two ways: using experimentation and observation.
We perform experimentation-based optimization in a non-production environment, with various test cases to simulate potential production scenarios. We can run any test case, evaluate the results, change our variables and rerun the test. The benefits of experimentation-based optimization include the flexibility to examine any scenario and the ability to perform deep-dive analysis of the results. However, these experiments are limited to a simulated environment, which may not incorporate real-world situations.
Experimentation-based optimization typically includes the following five steps.
Define the Input Variables
The input variables include, but aren’t limited to: compute, storage, memory, request, limits, number of replicas and application-specific parameters such as heap size, garbage collection, and error handling — really, any configuration setting that may affect the outputs or goals.
Define the Optimization Objectives
We specify the metrics to minimize or maximize in this step. We can also prioritize the variables we’re trying to optimize, emphasizing some objectives more than others. For example, we may consider increasing performance for computationally intensive tasks without paying much attention to cost.
Although ML-based optimization is helpful, it’s still up to the ops and business teams to identify the possible or required (expected) optimization objectives. Optimization objectives could be, for example, to use historical observability information to help optimize performance. Similarly, you might use service-level objectives and other key performance indicators to optimize scale and reliability.
From a business perspective, you may want to optimize cost, which could involve using a fixed cost per month, or knowing what budget to forecast for exceptional or peak load during seasonal timeframes.
Set Up the Optimization Scenarios
Once the optimization objectives have been defined and agreed on, we need to identify the different possible scenarios to be optimized before running the experiments. Instead of optimizing all scenarios our system could encounter, we should focus on those with the most significant performance and business impact.
Suppose our objective is to optimize performance and allow for accurate autoscaling sizing as part of an expected peak load. In that case, we’ll use different data sets from the past to run a forecast. For example, in the case of e-commerce platforms, these peak loads might occur following the Super Bowl and leading up to Thanksgiving sales, Boxing Day or the holiday shopping rush. If our objective is to get a better view of cost optimization, those parameters and expected scenarios to run will be different.
Once the optimization objective(s) have been defined and agreed upon, we can set up the actual scenario and build load tests for those scenarios. The load tests will help us mimic the production load during the experimentation phase. For our load testing, we can use several open source or commercial tools designed for Kubernetes environments.
Perform the Experiment
We use automation to deploy the application in the cluster using baseline parameters automatically. This automation then runs the benchmark test to apply load to the system. Once the benchmark is completed, metrics are collected and sent to the ML service for analysis. ML then creates a new set of parameter values to test under load, and the experimentation process continues.
With each iteration, the algorithm develops a complete understanding of the application’s parameter space and gets closer to the goal of optimal configuration for the Kubernetes cluster.
Analyze the Results
Once the experiment is over, we can do more analysis. By producing charts that illustrate the relationship between inputs and the desired outcomes, we can discover which parameters substantially affect outcomes and which matter less.
Observation-based optimization can be performed either in or out of production by observing actual system behavior. It may be optimal for dynamic conditions such as highly fluctuating user traffic. It typically includes these three phases:
Depending on our optimization method, various parameters can be considered, such as:
- Providing the namespace to limit the scope of our algorithm.
- Determining the values for K8s parameters to be tuned, such as CPU, memory and HPA target utilization.
- Finally, specifying configuration parameters such as recommendation frequency and deployment strategy (manual versus automatic).
The ML engine analyzes data from real-time observability tools such as Prometheus and Datadog to determine resource utilization and application performance patterns. After that, the system recommends configuration updates at the interval specified.
The final stage is to implement the recommendations generated by the ML analysis. We can determine whether these recommendations should be deployed automatically or manually during configuration.
These three steps are then repeated at a frequency that makes sense depending on the variability of your particular workload.
In conclusion, experimentation-based optimization allows more detailed analysis, while observation-based optimization provides value faster with less effort in real-world scenarios. Both approaches can bridge the gap between production and development environments.
Kubernetes Optimization with StormForge
Kubernetes optimization at scale cannot be done manually and requires intelligent automation. Optimization can be challenging for even small environments. We can solve the gap between automation and optimization with the help of ML tools and techniques. One such ML-driven Kubernetes optimization solution is StormForge.
StormForge provides ML tools to optimize performance, ensure reliability and increase efficiency while reducing operating costs. It automates the process of optimization at scale using both experimentation-based and observation-based approaches. It’s easy to use and can easily integrate with CI/CD pipelines for automatic deployment.
Application containerization using Kubernetes and other related tools for continuous deployment, monitoring and maintenance is the new paradigm of software development and deployment. ML algorithms enable multiple configurable parameters to be controlled in an automated fashion, enabling predictive models to be correlated with reality and optimizing given scenarios to meet specific business requirements.
With the power of ML, automation can alleviate the complexities of configuring multiple Kubernetes parameters, optimizing the trade-off between performance and cost.