CDN Outages: Exploring Ways to Increase Resilience
Issues at Content Delivery Network (CDN) providers have caused several high-profile outages over the past few years — a Cloudflare outage last month impacted a large proportion of its customers; a Fastly outage a year ago knocked out websites ranging from Amazon to CNN; and another Cloudflare outage in 2020, this time to its 220.127.116.11 DNS service, had a similarly broad impact.
In each case, the cause has been relatively random, though the common denominator has been human error. Cloudflare’s recent outage was caused by an error in a configuration change, Fastly’s 2021 outage was caused by a bug in newly deployed software, and Cloudflare’s 2020 outage was caused by a misconfiguration that overloaded a router in Atlanta.
Responding to the Unexpected
Those issues point to a key challenge in a constantly changing online environment: as Fastly chief product and strategy officer Lakshmi Sharma told The New Stack, no amount of testing or simulations can anticipate every possible contingency. “The internet and the large edge and cloud infrastructures that support it are highly resilient, but, as with any complex system with interdependencies on many fixed and variable pieces, the unexpected can happen — resulting in outages,” she said.
Fastly’s answer, Sharma said, is to be as direct and forthcoming as possible if and when issues do come up. “The best we can do is to be transparent with customers, be honest about what happened and share key learnings to not only help resolve the issue in the moment but help reduce recovery time when any outage occurs in the future,” she said.
In part, that means encouraging customers to consider a diversified strategy for increased resilience. “We want our customers to be successful, so we also recommend changes they can make in their own infrastructure, including implementing a multi-CDN or multicloud strategy for business continuity where strategically appropriate,” Sharma said.
Cloudflare declined to comment for this article beyond the information in the blog posts linked above, but they’ve been similarly transparent about the causes of these events and the actions they’re taking in response to them.
Still, when simple configuration mistakes can have this powerful an impact, is there a broader lesson to learn?
Considering Smaller Providers
Mark Boost, CEO of cloud service provider Civo, told The New Stack that outages like these suggest it’s never a good idea to put all your eggs in one basket: a single error at a single provider shouldn’t be able to take down a large swath of the internet. “Over-reliance on one or two providers is what companies should be looking at in de-risking themselves,” he said.
For a larger company with a global customer base, Boost said, it makes sense to count on a CDN to distribute content worldwide — but that may not be the case for many smaller companies, which would likely benefit from considering other options. “A lot of them probably aren’t using cached content in terms of images and things like that, and there are other solutions that can help with security,” he said. “There’s appliances you could buy, and there’s various smaller providers that you can use.”
And while many larger companies have free offerings that make it easy to get started with them, Boost said, it’s important to do so with an awareness of the risks. He pointed to the 2020 AWS outage that disabled iRobot vacuums as an example of the potential downsides of relying on a larger provider, which can face outages due both to human error and to cyber attacks. “I don’t know if people are really thinking about the security risks of targeted attacks against some of these huge companies,” he said.
With vast scale, Boost said, often comes vast complexity — which can lead both to issues on the provider’s end and to misconfiguration by the user. “If you think of the 150 services that Amazon offers, all with different options, there [are] lots of ways you can accidentally misconfigure things,” he said. “You may think you’ve got a high availability setup, but it turns out you haven’t configured it in the right way, and that could lead to a security risk.”
A Diversified Landscape
Still, there’s clear resistance to switching from larger providers to smaller ones: a recent Civo survey [PDF] of 1,000 developers found that 51% see smaller cloud providers as less secure, and 47% think they suffer more outages. “Sometimes people are unwilling to give them a chance, these small providers — but in reality, just because they’re small doesn’t mean they’re not very good at what they do,” Boost said.
All of that could, of course, be seen as a sales pitch for Boost’s own company, but he stressed that he’s more interested in persuading potential customers to broaden their range of options than simply selling them on his own offerings. “There [are] niche providers out there that might focus on certain areas, even more so than AWS — like a real high-security, lockdown, zero trust type environment,” he said. “There [are] lots of other people that are available.”
And the potential benefit of considering a range of providers is clear, whether it’s prompted by the kind of multi-CDN or multicloud strategy advised by Fastly’s Sharma, or by a desire to bring smaller providers into the mix as Boost suggests — the two aren’t mutually exclusive. “If we share some of that overall load that is currently with these hyperscalers with some of these small and medium-sized providers, it would mean the world is not locked into very few people — which is a dangerous place to be from a security perspective,” Boost said.
In that sense, Sharma’s and Boost’s suggestions, coming from a larger provider and a smaller one, are ultimately very similar: it makes sense to turn to multiple providers to increase resilience where appropriate; and regardless of the provider or providers you consider, transparency is key. “There’s not going to be a world where we’re not going to have any outages,” Boost said. “There’s always going to be something that goes wrong — it’s how you deal with it.”