Why Disaster Happens at the Edges: An Introduction to Queue Theory
Technologists love speeds and feeds: hefty caches, massively parallel processors, gigahertz, gigabits, petabytes and teraflops.
But in real-world applications, the user experience depends less on raw power than we often assume. Latency spikes, spinning wheels, website timeouts — often these problems are the result not of insufficient resources, but badly managed queuing.
What We Talk about When We Talk about Performance
To understand why, we have to focus on the right metrics. In particular, we have to go beyond the mean.
When it comes to IT performance, amateurs look at averages. Professionals look at distributions.
The natural world is full of distributions. Storms are more or less severe. Temperatures vary from year to year. Nothing happens exactly the same way every time.
Every distribution can be described as a curve. The width of this curve is known as its variance. More common results cluster near the peak. Less common ones sit further out on the tails. The steeper the curve, the lower the variance, and vice-versa.
It’s tempting to focus on the peak of the curve. That’s where most of the results are. But the edges are where the action is. Events out on the tails may happen less frequently, but they still happen. In digital systems, where billions of events take place in a matter of seconds, one-in-a-million occurrences happen all the time. And they have an outsize impact on user experience.
When it comes to IT performance, amateurs look at averages. Professionals look at distributions.
A moment’s reflection explains why. Average results are, by definition, unremarkable. They’re the ones people are used to — the ones they don’t notice. What people do notice are the outliers, especially negative outliers.
Just as the strength of a chain is determined by its weakest link, the efficiency of a system is defined by its worst-case scenarios. If you have a database with a speedy average latency of 1ms, users won’t remember all the times it returns results between 0.5 ms and 1.5 ms. What they remember is the experience they had when latency spiked to five seconds or more.
The greater the variance in a system, the more of those outlier experiences there will be — and the more expensive it is to handle or avoid them.
The Cost of Variance
Imagine a train that travels between London and Paris. Suppose that point-to-point, the journey takes 120 minutes, and that the average trip is delayed 15 minutes. Now suppose that in the distribution of all trips, the longest 5% are delayed 60 minutes or more. We call this a “P95” of 60. And suppose the longest 1% of trips are delayed 120 minutes or more, giving us a P99 of 120. In that case, we can expect one out of every hundred journeys to take almost two full hours longer than average.
Now imagine that the train company has a contract that stipulates the train cannot arrive late more than 1 in 100 trips. To fulfill the terms of the contract, the train will need to leave two hours early every time. That means we can expect that on 95 out of 100 trips, the train will leave at least one hour earlier than it needed to. That’s a lot of trouble and inconvenience for a problem that happens only 1% of the time.
Now to put this in IT terms, substitute system processes for the train, substitute SLA for the contract and substitute latency for the delay. Just as the train sometimes arrives later than usual, processes will sometimes take longer than normal to complete. The greater the variance in performance, the greater the chance of delay, and the more likely it is that resources will be tied up when a job arrives. We call this metric utilization, calculated as arrival rate (how often jobs arrive) divided by service rate (how often jobs are served). The higher the utilization, the longer the wait for jobs to be processed.
This phenomenon especially affects group-work processes like MapReduce or fork-join or anything that parallelizes work initially and then has to wait for stragglers at the end. With a MapReduce job involving 1,000 tasks, if even one of those tasks is delayed, the entire job is delayed by the same amount of time. In this case, the metric that captures the overall completion time of the job — and thus customer satisfaction — is not the average, but the P999. And this case is not uncommon! In our 1,000-task job, there is a 99.9% chance of at least one task hitting p99 levels of performance and a 63.2% of hitting the P999.
This is why it’s so important to look beyond the mean, and why it’s often said that “disaster happens at the edges.” It’s not the average latency that gets you. It’s the tail latency.
So the question is, how do we manage it? The answers come from a study of queue theory.
Queuing and Latency
Queue theory emerged in the early 1900s, from research to determine the most efficient operations of switchboard telephone services.
Queues are everywhere in digital systems: executors, sockets, locks. Any process that operates asynchronously probably depends on a queue.
A queue is essentially a worker (called a service center in queueing theory parlance), which takes jobs from a buffer where jobs await.
A queue’s performance depends on several factors including:
- Arrival rate: how many jobs arrive at the queue in a certain amount of time
- Service rate: how many jobs can be served in a certain amount of time
- Service time: how long it takes to process each job
- Service discipline: how jobs are prioritized (FIFO/LIFO/Priority)
We define latency as the time a job waits in a queue plus the time it takes to process it. One of the notable findings of queue theory is that latency approaches infinity as utilization approaches 100%. At first, latency rises slowly, but then in the higher percentiles it increases dramatically, as we can see from Fig. 1
W = Wait time
𝜏 = Mean service time
ρ = utilization
This suggests an important rule of thumb: for decent quality of performance, keep utilization below 75%. This means provisioning, not for typical loads but extreme ones. Without overcapacity, queues will form and latency will increase.
We can also see from Fig. 1 that latency is proportionate not just to utilization but to service time. It is possible to achieve reasonable latency with high utilization if your service is especially fast. But you need to be careful. A 10% fluctuation in utilization at 𝜌 = 0.5 will hardly affect latency (~ 1.1X), while 10% fluctuation at 𝜌 = 0.9 will crush performance (~ 10X).
It follows that when operating at high utilization, variance must be limited as much as possible. In systems with multiple parts, slower processes should be kept running with lower utilization, whereas faster processes can run at higher utilization.
“When we talk about changes in utilization we refer to “slow changes.” After all, utilization represents an average over some time window. But it’s also important to bear in mind that fast changes are important as well. The variance of the arrival and service rates also has a significant impact on latency — even if the utilization average is the same. Figure 2 below illustrates the effect each variable has on latency.
The first thing to note is that services with lower variation (those with lower c values), can counterbalance higher levels of utilization — but only up to a point. At the highest level of utilization (p 0.9), relatively small shifts in service speed cause dramatic increases in latency.
The same is true of service speed. Latency swings up toward infinity as utilization approaches 1.0. But slower services are far more vulnerable to latency spikes as utilization increases.
The lesson: whatever its source, variance is the enemy of performance.
As we’ve seen, processes run faster at times and slower at others. Similarly, jobs arrive at the queue at different rates. Over time, delays accumulate and the queue backs up. This raises another important consideration: queues are almost always empty or full. In digital systems, they fill up so fast that it’s rare to see one in an intermediate state.
At a minimum, large queues result in higher latency. They also create the risk of stale results. By the time a 10-second job finishes, the client might not want the results anymore. Perhaps the user has already navigated to another web page.
Given how quickly they can grow, unlimited queues are especially dangerous. Even a slight overload can quickly snowball out of control, resulting in memory pressure and, ultimately, out-of-memory conditions. Unfortunately, unlimited queues are quite common. Java executors, for instance, have unlimited queues by default.
So one simple precaution we can take is to cap the size of the queue. The second is to put a time-to-live (TTL) on work items, and have the process throw out any jobs that exceed the limit.
So far we’ve talked about the internal workings of the queue. But in real-world systems, you have multiple services that generate workloads for one another, each with its own queue and variable range of performance. Ideally, services will avoid creating excess workloads for other services downstream from them.
The way this is done is known as “backpressure.” This is the process whereby downstream services notify upstream systems when their queues are full. The upstream service then notifies the services upstream from it, and so forth.
The TCP protocol, for instance, generates backpressure with code 503. Thread pools can implement backpressure by simply limiting the size of the pool. Blocking code has backpressure by default on a per-thread basis, which back-propagates naturally through the blocking chain.
But systems like Kafka don’t have backpressure by default. Luckily there are several open source projects that implement backpressure in Kafka, so that the webserver knows when there’s an overload in the system.
The question then is what to do about it. It’s basically a trade-off between latency and error rate. If we don’t return errors, or find an alternative way to shed the excess load, then latency will inevitably increase.
One approach, as we’ve seen, is to cap queue size and shed any job over the limit. This can be done by implementing a queue at the webserver and having the server return an HTTP 503 response for any request over the cap. Alternatively, the server could present a cached version of the results or send the request to a fallback service. However it’s handled, the goal is to remove excess requests from the overloaded queue.
Another closely related approach is to throttle the arrival rate. Instead of regulating the absolute number of jobs in the queue, we can calculate how much work the queue can handle in a given amount of time and start shedding load when the arrival rate exceeds the target.
To sum up: Variance is the enemy of performance and the source of much of the latency we encounter when using software.
To keep latency to a minimum:
- As a rule of thumb, target utilization below 75%,
- Steer slower workloads to paths with lower utilization,
- Limit variance as much as possible when utilization is high,
- Implement backpressure in systems where it is not built-in,
- Use throttling and load shedding to reduce pressure on downstream queues.
It follows that developers should aim to design and implement software that delivers not just high average performance, but consistent performance, i,e, with low variance. Even an incremental reduction in variance can improve the user experience more than an equivalent increase in raw average speed.
The mean isn’t meaningless. It’s just deceptive. To transform user experience, the place to put your effort isn’t in the middle of the curve but on the ends. It’s the rare events — the outliers — that tell you what you need to know.