Well, wasn’t that fun? On June 8, 2021, many internet users went to their usual sites such as Amazon, Reddit, CNN, or the New York Times and found nothing but an “Error 503 service unavailable” and an ominous “connection failure” note. So, what happened? The Fastly Content Delivery Network (CDN) had gone down: Hard.
How did a service failure by a company known only to ISP administrators cause such havoc? Simple.
When the internet began as ARPANet its job was to deliver fault tolerance and robust connectivity even if there were a nuclear war. When it became commercial in 1993 thanks to the web and the Commercial Internet Exchange (CIX) other features became important. In particular, everyone started demanding faster performance and lower latency.
The solution? CDNs. These companies, which besides Fastly include market-leader Akamai and Cloudflare, all use the same basic techniques to speed up the net. They take the data from popular sites and place it in distributed caches in points of presence (PoP) close to consumers.
If that sounds familiar to you even if you’re a cloud native developer and not a network administrator there’s a good reason. CDNs were one of the first business models to rely on an edge computing model.
This usually works well. Indeed, over half of the internet’s traffic, today flows through CDNs. When you visit a website now, chances are you’re not getting the data directly from the site but from the closest, supported CDN PoP. The technology is mature and well understood.
So, what went wrong with Fastly?
Fastly Senior Vice President of Engineering and Infrastructure Nick Rockwell explained, “On May 12, we began a software deployment that introduced a bug that could be triggered by a specific customer configuration under specific circumstances.” Then on “Early June 8, a customer pushed a valid configuration change that included the specific circumstances that triggered the bug, which caused 85% of our network to return errors.”
The company got its services back with an emergency fix in less than an hour. But, for hundreds of millions of users, those were still way too many minutes of dead air. Later that day, Fastly started deploying a permanent bug fix.
Fastly, as it continues to deploy its fix across its network of PoPs isn’t saying exactly what the problem was. We know, however, the company is “conducting a complete post mortem of the processes and practices we followed during this incident;” to work out “why we didn’t detect the bug during our software quality assurance and testing processes;” and “ways to improve our remediation time.”
Looking ahead Fastly will make fundamental security improvements to its underlying infrastructure. They’ll be doing this by using the “isolation capabilities of WebAssembly and [email protected] to build greater resiliency from the ground up.” How exactly? Rockwell promised, “We’ll continue to update our community as we make progress toward this goal.”
This isn’t the first time a CDN went down and took many sites with it. In 2020, Cloudflare saw a half-hour outage that covered most of Europe and the Americas. That outage happened because of a bad one-line fix to a physical problem. The result was a cascading failure that would eventually knock out almost 20 PoPs globally.
It won’t be the last time. CDNs have quietly become essential to the modern internet. And, as these two most recent cases have shown, they’re fragile.