Reduce Alert Fatigue and Improve Your Kubernetes Monitoring
Alert fatigue is a state of exhaustion caused by receiving too many alerts. This can happen when the alerts are not actionable, are irrelevant or too frequent.
Misconfigurations or configurations with the wrong assumptions or that lack Service-level objectives (SLOs) can have a dual impact, leading to alert fatigue and, more alarmingly, the potential of overlooking critical alerts
We spoke with more than 200 teams using Prometheus Alertmanager. Many face alert fatigue from trivial, nonactionable alerts.
It’s not too puzzling to set up monitoring for your entire infrastructure anymore, but how can we combat alert fatigue to ensure that the critical alerts aren’t missed and to make informed choices regarding metrics and thresholds?
Let’s dive into Prometheus Alertmanager. We’ll outline the ideal metrics and guide you on establishing the appropriate thresholds.
What Is Prometheus Alertmanager?
Prometheus, an open source monitoring system, is equipped with a dynamic query language, a highly efficient time series database and a cutting-edge approach to alerting.
Its companion application, Alertmanager, intercepts alerts dispatched by client applications, including Prometheus, and takes care of deduplication, grouping and precise routing. Alertmanager seamlessly channels alerts to their specified recipients through integrations like email, Slack, Zenduty or PagerDuty.
Together, Prometheus and Alertmanager provide a powerful and modern monitoring solution that helps you improve incident response, reduce alert fatigue and ensure the system is reliable.
It offers versatile features, allowing you to precisely filter, group, route, silence and inhibit alerts.
Filter and group alerts using criteria like labels and expressions to concentrate on critical matters, and then send them to suitable destinations like email, Slack, Zenduty or PagerDuty to ensure the relevant people are notified.
Additionally, you can temporarily silence alerts to prevent excessive notifications during crucial incidents and inhibit alerts based on specific criteria to prevent redundancy and nonessential notifications.
Now that we understand the capabilities of Prometheus Alertmanager, let’s dive into defining effective Prometheus metrics.
What the Right Prometheus Metrics Should Be
Prometheus Alertmanager is a powerful tool, but only if you use it right. Imagine not setting up any alerts for your Kubernetes cluster.
That would be a huge mistake. But setting too few alerts or missing critical metrics is just as bad. And overloading yourself with too many wrongly labeled or unnecessary alerts is a recipe for alert fatigue.
Nailing those alerts with precise thresholds is the secret to reliability and seamless operations.
But the question is: What should a well-configured Prometheus Alertmanager look like?
Here are some characteristics you should consider:
- Well-defined — The metrics should have a clear and concise definition. This will help the team to understand what the metric measures and how to use it.
- Actionable — Being woken up by an alert can be unsettling, especially when you’re not sure how to respond or if it’s something beyond your control. That’s why it’s crucial to have actionable metrics. When you receive an alert, you should have a clear understanding of what steps to take to address the underlying issue and resolve it effectively.
- Informative — Provide valuable information about the system or application being monitored when setting up Alertmanager metrics. These details can be used to identify and resolve problems, improve performance and ensure the overall health and reliability of the system.
- Impactful — Engineers don’t want to wake up in the middle of the night for something that won’t affect the business. Alerts should be related to something that could affect your business. If you’re not sure whether an alert is important, err on the side of caution and don’t alert.
Every organization should keep an eye on specific Prometheus Alertmanager metrics and set up alerts for them.
Some basics to cover, for example:
- It’s crucial to watch the count of 4xx and 5xx requests within a minute. If over 60% of all requests are 4xx, trigger notifications. Additionally, distinguishing between 500s and 400s is vital. Set an alert when 500s are detected.
- Create an alert to send a notification when your Horizontal Pod Autoscaler (HPA) is approaching its maximum capacity.
- Establish alerts for container CPU usage with thresholds that align with your benchmarks and expected response times. This ensures timely notifications for any abnormal resource consumption.
- Ensure you’ve configured an out-of-memory alarm that triggers when pods face memory issues and risk termination. This helps prevent critical failures due to memory constraints.
- Detecting when there are too many requests being returned with a 5XX can help correlate system/code changes to dropped requests.
In addition to the mentioned metrics, there are several other essential metrics that we recommend organizations to consider, such as:
- Keep an eye on the number of node context switches occurring in a 5-minute timeframe. When this count exceeds 5,000, trigger notifications.
Consistently high context switching indicates the need to switch to a memory-optimized (RAM) instance instead of sticking with the current configuration for too long. Context switching is typically used during the R&D phase when benchmarks are still being established.
Not monitoring this metric can leave us in the dark about performance issues. If our performance consistently matches our usual benchmarks, we can reduce the frequency of monitoring to every 30 minutes instead of every five minutes to reduce unnecessary alerts.
- Set up an alarm to notify the team when the number of pods decreases to below a certain threshold.
For product teams with setups that might face physical pod shutdowns, this alert can be a fundamental lifeline, notifying the team of such failures.
This alarm will fire alerts when the pods hit minimum threshold capacity and will be a constant source of noise for products that are well scaled and expect to be running on low resource consumption.
- If you don’t know something has gone south, how do you find out what went south?
Sometimes we may rely too much on automation and forget that we need to track auto-restarts. A basic alarm that is sometimes missed is not getting alerted for pod restart. This alert can be a valuable tool for connecting the dots between other service modifications and potential delays.
- Attaching an unsupported node to your cluster can cause unexpected behavior and make it difficult to troubleshoot problems. To prevent this, set up an alert when an unsupported node is attached.
- Monitoring what Prometheus is scraping is highly recommended. If it runs out of memory, your Prometheus instance can become unstable or experience frequent restarts, causing delays in alerting.
Getting the Right Metrics Is Not Enough
Alertmanager metrics are crucial, but they’re just one part of the equation. The other half is configuring the thresholds correctly.
Setting thresholds too low results in an avalanche of alerts for minor metric changes, causing alert fatigue. Conversely, if thresholds are too high, vital alerts may slip through the cracks.
Remember: The ideal threshold varies based on your infrastructure and business needs.
Setting the Right Thresholds for Alertmanager to Reduce Alert Fatigue
- When configuring Alertmanager metrics, review and adjust the rate limit settings and equations. Take a moment to understand the intended behavior and consider how you’re scraping the metrics, as this approach significantly affects the setup process.
- It’s essential to review who receives notifications for alerts. Ensure that the right people are being notified. Proper segregation of alerts is crucial here. Overwhelming your engineers with unnecessary alerts can have a negative impact on their performance and overall productivity.
- Understand the purpose behind setting an alert. Sometimes the alert for a specific metric might be unnecessary, leading to unnecessary alarms. Before configuring alerts, ask yourself: What is the alert intended to indicate? This clarity will help ensure that alerts are meaningful and valuable.
- When configuring alert equations, it’s essential to conduct a thorough analysis to identify potential weaknesses in the metrics.
Perform statistical analysis regularly to understand how metrics interact and affect system performance. It’s crucial to anticipate which systems might be affected by the alert. This proactive approach allows you to address potential issues before they escalate into full-blown incidents, ensuring smooth operations and minimizing disruptions.
- Recognize that certain alerts are expected and should not be considered unusual. To prevent alert fatigue, consider silencing notifications for these expected alerts. This strategic approach ensures that your team remains focused on critical issues while reducing unnecessary noise and distractions.
At Zenduty, we offer integrations with 150+ application and monitoring tools. However, one of the most commonly used yet often misconfigured integrations is Prometheus Alertmanager. One of our early customers once said, “Prometheus Alertmanager works too well for its own good.”
While that might be true, it’s all we have for now.
We believe that these strategies should help your team effectively combat alert fatigue, enabling engineers to establish accurate thresholds and alerts within the Prometheus Alertmanager.
Which one of these worked for you? Did we miss something important? We’re eager to learn about your approaches to configuring metrics and thresholds to combat alert fatigue. Share your insights with us at email@example.com.
Meet us at Kubecon 2023, and drop by booth M35. We’d love to talk about site reliability and share some war room stories with you!