Engineering teams today face a growing imperative to keep systems running smoothly. That’s because the boundary between businesses and the digital customer experience is shrinking — and when the application is the business, anything short of a perfect user experience means a hit to the top line.
Against this backdrop, a sound alerts strategy is now mission-critical for nearly every engineering team. According to a recent survey, 58 percent of DevOps professionals report relying on five or more observability tools to identify the root cause of performance issues, which means for every problem they sort through thousands of alerts across multiple locations in order to find the answer. Since it’s impossible to monitor every aspect of a system’s health at all times, a smart alerts strategy can help teams operate more efficiently and focus their attention where it’s needed most, keeping the business running smoothly.
But setting alerts is not always as simple as it seems. To do it right, teams must not only contend with the sprawling complexity that exists below the surface but also combats “alert fatigue,” or the tendency to ignore notifications when they become too frequent. The key in both cases is striking the right balance between cadence and urgency. To help your team build a better alerts strategy (and sleep better at night), here’s an introduction to how we approach the challenge at Scalyr:
First, a word about structure. At the highest level, alerts are designed around three key pillars: metrics (the performance measures you want to monitor), comparison statements (i.e. “greater than,” “less than,” “equals”), and thresholds (the key values you want to compare metrics against).
For example, to monitor for unexpected spikes in CPU usage, you might set the following alert: if average CPU usage over the past 15 minutes is greater than the past 24-hour average, trigger alert.
When creating alerts, it often helps to start by establishing thresholds. There are three types that can be applied to alerts: fixed, state-based, and historical.
Fixed thresholds, as the name suggests, trigger alerts when performance levels cross some static, predetermined level. These are the easiest thresholds to set and work best when measuring performance metrics with hard limits, such as CPU usage or hard disk space.
State-based thresholds are also simple and straightforward. They identify changes in the state of a system — for instance, “running” or “stopped” — and are mainly useful for monitoring application processes for unexpected downtime.
Historical thresholds are slightly more complex. Also referred to as “sliding windows,” these thresholds compare a metric’s current value with its past values. In other words, monitoring a specific period of time (say a 15-minute window) that continues to move as the clock ticks. For example, an alert based on a historical threshold might read: If 15-minute average CPU usage is greater than 120 percent of 15-minute average CPU usage 24 hours ago, trigger alert. The idea here is to identify sudden spikes in a given metric while filtering out noise.
On this theme, it is also useful to include “grace periods” when setting alerts. In other words, adding a rule that prevents the alert from firing unless it’s triggered for a sustained period of time. This helps with filtering out small aberrations and identifying longer term activity.
Once thresholds are mapped out, it’s time to identify the metrics that underpin your alerts. There are five key categories of metrics that every alert strategy needs to cover: capacity, bandwidth, state, rate, and event parameter metrics.
Capacity metrics monitor system components where there is a fixed and known capacity, such as disk space and free memory. If something can “fill up” or “run out,” it’s generally measured with a capacity metric. When capacity is reached, the system breaks, which means the threshold should be set somewhere below maximum capacity. When the consequences of hitting capacity are critically disruptive, the threshold should be set especially low.
Bandwidth metrics measure flows within a system. For instance, network utilization, CPU usage, and disk throughput. This type of system activity is often highly variable by nature, so consider basing the alert on a moving average. For instance, instead of monitoring network usage by looking at the latest measurements, you want to set it to measure average usage over, say, 30-minute windows.
State metrics have distinct values that are used to monitor changes in system state, such as “running” or “stopped.” These system-status metrics are generally pegged to state-based thresholds.
Rate metrics are used to measure the rate at which certain events take place. In other words, the number of events that happen within a certain period of time. For instance, “requests per second,” “errors per second,” “login attempts per minute,” and so on. For predictable metrics, historical thresholds are best. For less-predictable ones, basing the threshold on historical highs or lows is often a sound approach.
Event parameter metrics are based on some measurable attribute of an event, such as response time or request size. The metric is typically used to monitor and report an average value for all relevant events over some period of time. The thresholds you set for event parameter metrics is similar to those for rate metrics — use a historical threshold for predictable metrics, and a fixed threshold for less-predictable ones.
With the right alert thresholds and metrics in place, the next step is building a system for receiving notifications when issues crop up. This step is critical. You can have the best alert parameters in the world, but they won’t do any good if the alerts don’t reach you.
Modern-day DevOps environments present a variety of options. For instance, alerts can now be stored in databases and delivered via email in daily batches, emailed immediately, sent via SMS or phone call, or some combination thereof. The trick is figuring out what interaction model works best for your team.
Generally speaking, urgent alerts should be interruptive by design. They need to catch your attention immediate and convey the urgency of the problem. For this reason, it also often makes sense to “stack” alerts on the same metric, at different thresholds and with different notification methods. The idea here is to escalate alerts as an issue becomes more urgent.
There’s far more to alerting than meets the eye. Setting a sound alert strategy requires thoughtfulness across three key pillars: thresholds, metrics, and notifications. As application uptime becomes mission-critical for every business, taking the time now to figure out the right combinations for your application environment and team can make all the difference.