Training a ML Model to Forecast Kubernetes Node Anomalies
This is part of a series of contributed articles leading up to KubeCon + CloudNativeCon on Oct. 24-28.
It’s no surprise that using artificial intelligence to improve IT system operations is in the spotlight, considering the five benefits experts attributed to it: the ability to proactively manage, remediate faster, improve productivity, collaborate efficiently and improve application performance.
Using machine learning to forecast system anomalies and reduce alert noise are considered key domains to improve the performance of IT operations. The growing use of open source/standard stacks such as Kubernetes and Prometheus that enable collection of high-quality data, such as metrics and logs, and the increasing accuracy of machine learning are driving the push to adopt it.
However, to increase the accuracy of machine learning, organizations need to collect proper data sets to train the machine learning model. For this purpose, various types of outages need to occur, and information such as metrics, events and logs from the relevant monitoring targets must be continuously collected and fed to the models to increase their accuracy.
Even if an individual organization continuously collects data sets, it is necessary to collect large-scale data for a fairly long period of time to achieve a certain accuracy level in the machine learning model. This requires effort to tag whether the monitoring targets are in the anomaly status. Additionally, validating the results from the ML model is difficult and labor intensive.
To overcome these challenges and apply machine learning to forecast anomalies of computing resources, my team devised the idea to use the Bayesian network approach to secure training data at the initial stage. A Bayesian network starts with the experts’ rule set to get a certain level of model performance. This idea will help organizations gather a basic data set to train the model even when they don’t actually have enough data to do so.
Further, our team aimed to monitor Kubernetes nodes since the standard open source software, such as Prometheus, Node Exporter and cAdvisor, can be installed to generate data sets to evaluate Kubernetes resources’ anomalies.
We chose Kafka as well as a Prometheus-Kafka adapter to receive metric feeds from Prometheus as a metric pipeline. To receive data from Kafka’s topics and provide learning data sets, our team decided to develop a metric evaluation engine to consume and pre-evaluate metrics using rule bases from system experts.
The pre-evaluation results from the engine are stored in a data mart for the machine learning pipeline. The machine learning pipeline is configured on Kubeflow, a machine learning pipeline platform that operates on Kubernetes. A TensorFlow machine learning engine was chosen to forecast the anomaly model, and the evaluation results are stored in MariaDB.
The figure below shows the overall solution architecture to depict the entire process from the metric feed to saving the evaluation result.
Metrics should be processed within 30 seconds from the metric collection process by Prometheus to the evaluation of metrics by the machine learning model. The default Prometheus metric collection interval is 30 seconds. The 30-seconds-to-1-minute interval is widely accepted as a best practice for system monitoring. This means the metric pipeline from Prometheus to the anomaly forecast by the machine learning model should be completed in 30 seconds.
In a Kubernetes cluster, the metrics provided by Node Exporter, cAdvisor, and Kubernetes are 5,000 per minute. However, the number of must-have metrics for Kubernetes node and pod anomaly forecasting vary depending on the number you’re looking at. About 40 metrics to 50 metrics per Kubernetes resources are enough, therefore the process needs capabilities to filter the must-have metrics to minimize the data processing time.
The metric evaluator pre-evaluates the target nodes using metrics related to CPU, memory, file system and networks applying preset rules from the experts’ guidelines, and saves the evaluation results every 30 seconds. The implemented pipeline is able to process the metrics from the cluster, but it sometimes takes more than 30 seconds to complete the pipeline process.
The machine learning pipeline reads the pre-evaluation results every 30 seconds and feeds the results to train the model as well as evaluate system anomalies so the information can be used for IT operations. The saved evaluation results can be used to mute alert noises and manage system outages proactively.
Implications from the Implementation
After the pipeline and system deployment, it was difficult to train the machine learning models because there was no outage or cluster issue for several days. So our team had to induce a situation so the pipeline would produce pre-evaluation results to train the models.
It might take more than 30 seconds for end-to-end processing if you need to monitor more than two Kubernetes clusters, unless the pipeline process is horizontally scaled. We considered adjusting the metric filter logic to reduce the target metrics to shorten processing time, but our team decided to use a 1-minute process time in order to not lose business contexts.
The evaluation results are largely explained by key metrics such as CPU usage, memory usage, storage, high network traffic and dropped packets. The remaining influence of the metrics for the anomaly was quite low. Creating more subtle machine learning models might require a longer training period.
The team is still discussing when we can turn off the rule-base pre-evaluation to train the models, and we haven’t turned it off yet.
Automating and monitoring the pipeline are key success factors since the entire process is quite complex.
Overall, the feedback process will be crucial. The IT operation team should be able to provide feedback about anomaly detection for successful or failed cases, and the feedback should be fed to the machine learning model.
Correlation between Kubernetes resources also needs to be considered as one of the inputs to the model. Pod and volumes anomalies might be the cause of node failure, and the machine learning model would be improved to accommodate the correlation.
In addition to metrics, various variables such as Kubernetes events and application logs would be helpful to improve the model’s performance.
To hear more about cloud native topics, join the Cloud Native Computing Foundation and the cloud native community at KubeCon + CloudNativeCon North America 2022 in Detroit (and virtual) from Oct. 24-28.