Observing and Experimenting: Enhanced Kubernetes Optimization
As organizations increasingly adopt Kubernetes for their infrastructure, understanding and optimizing its performance becomes vital.
However, best practices and configuration guides for Kubernetes optimization only go so far. Researchers wrote more than 900 scholarly articles on factors influencing the optimization of Kubernetes in 2022 alone. Many variables can affect the speed and efficiency of applications running in a Kubernetes environment, but to gather the insights that lead to effective optimizations, you need to consistently observe and experiment.
In this article, we’ll look at how to develop a system of observation, experiments and feedback loops to continually improve the performance of applications running in Kubernetes environments.
Combine Observation and Experimentation
IT infrastructure is analogous to a manufacturing environment, comprising many components that move, process and store data, so it can be useful to the organization. For the system to produce optimal throughput, components must be secured, configured, instrumented and linked to one another and to the data being processed. However, merely setting up the components and letting them run isn’t sufficient. Instead, we must observe the process by collecting metrics on each component in the system and their interactions.
Learning from Observation
In the application development environment, things change daily. Network traffic, data quality and new software or security patches can all affect performance, affecting the health of your application deployment. You need proper metrics to notice these changes and optimize application performance.
There are many categories of metrics. Common examples include Kubernetes cluster and node usage metrics, container and application metrics and deployment and pod metrics, such as the number running or the number of Pod resource requests. You might also consider environmental metrics like traffic, network state and new deployments, since adding new services or features to an application can change infrastructure efficiency and effectiveness.
For example, website traffic is often seasonally variable: Retailers know that traffic increases over the holiday season and decreases after. To avoid delays and lost customers in peak season, or overpaying for capacity in the off-season, they need accurate metrics to configure their application scaling.
A key consideration is that your application and the Kubernetes infrastructure it runs on are not static. Variables that influence its performance change over time. To find opportunities for improvement, you need baseline metrics detailing how the infrastructure runs under average conditions. By establishing and understanding these metrics, you can analyze the effects of traffic spikes, outages and other events.
Once you have established which metrics you wish to observe, you need tools and a repository to capture and store this information. A common choice is the open source tool Prometheus.
Learning from Experimentation
Once you’ve established metrics and tooling, you can start learning from the data. You can simply observe your metrics over the course of a week or month, which allows you to see how your application performs under various circumstances. You might be able to pick up on simple patterns and make changes based on these patterns, and then watch for another week or month to see how it affects your metrics.
This is a very slow and imprecise form of experimentation that uses human observation over long periods of time — when your application, the traffic to it, or the environment it runs in might also be changing and introducing noise into your experiment.
Let’s go back to the example of the holiday season and retail. Suppose high volumes of application use at peak season don’t affect throughput. In this case, this might mean you have the configuration and scaling just right or that you have over-provisioned. To find out which it is, you need controlled experiments. You can discover the most cost-effective configuration by systematically varying the traffic, application and service configurations and closely observing the results. By changing the different parameters in many combinations and tracking the results you can eliminate the noise introduced in simply observing the application over time.
However, it quickly becomes difficult, if not impossible, to perform these experiments manually because of the number of variables, and even worse, the possible combinations of variables. The solution is to apply automation and machine learning (ML).
ML is a subset of artificial intelligence that enables a computer to ingest data, learn and develop an underlying model to predict and respond to data inputs. Data is divided randomly into two sets. The first set is used to train a model; then this model is applied to the second data set to predict outcomes. Next, the predicted outcomes are compared to the actual outcomes to assess the accuracy of the model. The results are adjusted, the ML program develops a new model and the whole process restarts.
Observed behaviors or conditions can and should lead to experiments to gain a deeper understanding. Continuing with the traffic example, let’s say you want to include additional variables in a new model or simulate conditions not present in the original data set. By increasing traffic levels above what was observed in the data set, you can better learn how robust the model is under additional traffic.
Benefits of Experimentation
This kind of experimentation is all about learning how a dynamic system behaves so you can optimize its efficiency. To understand the capabilities and limitations of specific environments, or combinations of configurations, you need to experiment.
Experiments should aim to discover what you should do differently. Are you over- or under-provisioning resources? Are you spending more than you should? One experiment may not answer all the questions, but it can point the way to the next test and an eventual solution.
If your initial application configuration is based on best practice guidelines, you may want to validate those guidelines for your specific situation. You can vary application configuration, traffic patterns, horizontal pod autoscaling (HPA) configuration, etc. You can conduct a variety of A/B tests of different configurations.
Experimental findings don’t always yield improvement. However, finding a specific setting robust to changes in traffic is also valuable, as it means we can limit our experimentation in this area.
Kubernetes is highly flexible when it comes to automating container operations across environments, with many configuration options. When it comes to optimizing configuration, the best practice guidelines only go so far. To push the boundaries, we need to establish metrics, experiment with the environment and observe closely.
The number of variables and interactions often makes it impractical to configure Kubernetes environments manually. Automation and ML make the experimentation process viable. Observability processes and analytics allow us to conduct planned experiments and understand how environmental variables affect performance.