What We Can Learn from Twitter’s Outages
“All of these systems are fundamentally people and machines. And there’s no getting away from that for a very long time — hopefully, a very, very long time. And so, if all we look at is how the technology failed, we’re missing a huge part of both how it fails and how it works.” Courtney Nash, research analyst at Verica, told The New Stack.
Last Monday, many Twitter users received an unusual API error message, when a configuration change — allegedly made by a single engineer — caused much of the site to not work for everyone as expected. Photos wouldn’t load, the Twitter-owned TweetDeck service was down, and links didn’t open when clicked, among other unusual behavior. This was all a result of changes made as Twitter looks to move from an open API culture to a premium-only API service — even controversially charging for the industry security standard two-factor authentication.
The outage only lasted a few hours but it offers a lot of valuable lessons for the tech industry as a whole around the impact of change management and tech layoffs on complex sociotechnical systems.
What Went Wrong?
We don’t and probably won’t know what really occurred because “It’s impossible for those outside the company to know exactly what happened to cause the most recent API outage at Twitter, but in their efforts to build a new, paid API for developers some changes pushed to production had unintended consequences,” Kate Holterhoff, an analyst at RedMonk, told The New Stack.
What we do know is that it could happen to any company, especially an enterprise that’s reached the technical complexity, scale and global distribution of Twitter.
“It sounds like a garden variety configuration change,” offered Nash from what she’s read from CEO Elon Musk and Twitter Support‘s tweets. “And, depending on who you ask, it was a single engineer who pushed an incorrect configuration change that nuked their API situation.”
This may have made headlines because of the politics of the Twitter transfer of power, and because of the 450 million user base that could’ve been affected, but it’s not a shocker. As a self-proclaimed internet incident librarian who built VOID, Verica’s Open Incident Database, Nash found that 400 of the currently 10,000 incident reports from 590 organizations listed on VOID involve some sort of configuration change.
These are, she says, “notorious in these kinds of cascading failures,” referring to the microservices hazard where a failure in one part of a system of interconnected parts can lead to the failure of other parts or even the whole system. Microsoft, Amazon Web Services, Facebook and Cloudflare are among the companies that’ve also had config incidents.
Historically, Twitter, she noted, hasn’t publicly reported their incidents. VOID was created to encourage sharing of outages so the tech industry can learn among peers.
While we don’t have an incident report or postmortem from this week’s Twitter API outage, there are some guesses. This week, Musk, who became the Twitter CEO when he bought it for $44 billion last October, publicly called the Twitter code stack “extremely brittle for no good reason.”
Again, not unexpected. “I’ve yet to see an enterprise system that wasn’t brittle, because they’re built out so fast and not always given the sustainment love they need. So I don’t doubt that Twitter has a brittle system. It’s been built over the years by many different waves of developers,” Kin Lane, chief evangelist at Postman, told The New Stack, making Twitter’s systems too interrelated to scale down or be shut off, without a full refactoring and a purposeful roadmap to get there.
“What could have prevented this was not pulling the plug on an entire data center and taking out three-fourths of your microservices and firing all of the people that are responsible for those services.” — Courtney Nash, Verica
Yet, Musk has endeavored to shut off likely a majority of Twitter’s microservices. And last month, following his data center cost-cutting efforts, Twitter experienced one of its most significant outages that left users unable to like, tweet or direct message. Which begs the question: Why all the outages lately?
“So is it brittle because it’s older and legacy [architecture] and hasn’t been rewritten in a long time and doesn’t evolve? Is it brittle because a lot of microservices are being shut off? Is it brittle because things aren’t maintained — databases, indexes, gateways, things that are run the API? I would say D, all the above,” Lane continued. This is why, like all things brittle, the Twitter system should be handled with care.
Technology Is Only Part of the Problem
Like all things in the tech industry, this is really a sociotechnical concern, not just a technical one. Of course, amplified in this is the most aggressive of recent tech layoffs, with up to 75% of the Twitter staff cut in just four months. That’s not just slicing overhead, that’s cutting out swaths of experience and expertise. There are unconfirmed rumors that Twitter staff is down to only one or two site reliability engineers, who are the stewards of the uptime and often the security of a service. That means that likely no one was applying best practices like chaos engineering to try to figure out ahead “What would happen if…”
“I think the issue is, there was probably not the institutional knowledge to know how to recover from that well, or to know the extent of what might happen there. But I don’t think that it’s shocking that something like this could happen either,” Nash said.
This shouldn’t be about pointing fingers at one engineer, but acknowledging that “It was one person in a very complex sociotechnical system. One person didn’t take down Twitter,” she continued. We are “blaming the humans in this system where, most of the time, the humans are the reason the system works 90-whatever-percent of the time — but we never talk about that.” In a way, she says, Twitter has shifted this conversation from finding blame to talking about how important institutional knowledge is.
This also wasn’t the first time a singular decision has taken down parts of the social media app. Last month, another outage occurred — the one where folks got “You are over the daily limit” notifications even if they hadn’t tweeted yet that day — when an employee accidentally deleted data from Twitter’s internal service on rate limits, likely in service to another premium feature. But then there was no one left on the team that created that service, according to The Verge, making it notably more challenging to fix it without anyone with the necessary domain expertise.
This all reflects what Manuel Pais, co-creator of Team Topologies, told The New Stack — a lack of a team-first approach. There’s a big risk when tech companies keep letting go of longer-term employees just because they are paid more.
“Don’t let go of the people who are the experts, those are the ones who are going to be powerful if they embrace this idea of teaching and helping others,” he said.
What Could Prevent Future Outages?
That’s not to say there aren’t ways to improve the code — Twitter is the origin of the fail whale meme, after all. But technical changes must be driven by people and processes.
“Insiders often complain of Twitter’s hairball of legacy code — which, to be fair, is nearly always the case with large, complex systems that have evolved over years,” Holterhoff said, echoing the refrain across interviews. “One of the issues with legacy code and architectures is that they can trip up efforts to implement more cutting edge processes like progressive delivery.”
Progressive delivery is an umbrella term coined by her RedMonk colleague James Governor to describe the practice of routing traffic to subsets of users in an effort to reduce the blast radius of any release. This includes canary testing, feature flagging, A/B testing, blue-green deployments, service mesh, and more. It’s often aligned with a broader strategy of chaos engineering and observability to not only minimize impact when failures occur but to experiment with and interrogate systems, increasing reliability and reducing incidents.
Reflecting on this week’s Twitter API outage, Holterhoff continued that “It sounds like whatever policies the platform team has put in place to facilitate progressive delivery failed to alert them of issues that, if working properly, would have enabled them to avoid deploying these changes universally across the platform.”
In the last few months, the Twitter company culture has undergone a major mindset shift. “Twitter was notoriously slow — until Musk took over — about how they did things and how they changed things,” Nash remarked.
In contrast, now there seems to be a lack of technical leadership or a top-down strategy to not just move fast and break things.
“The fallout from this failure — the most recent of a series of service outages this year — is the continued scrutiny of Twitter’s reliability and its unwillingness to take full responsibility,” Holterhoff said. “Beyond Musk’s theatrics — particularly those relating to a severely reduced workforce — these high-profile outages signal both a lack of resources and willpower in the org to adequately test and vet the code being shipped.”
Again, cascading failures are not an if but when for complex teams, but industry best practices like those around progressive delivery can help better prepare and enable quicker recovery. Of course, you need educated, experienced or trained teammates to practice them.
“It’s hard to always know what the blast radius of those kinds of changes are until you experience them. So what could have prevented this would probably have been careful investigation of other incidents on those systems,” Nash said, ideally smaller ones. This includes a transparent culture around postmortems and documentation that helps support staff changes.
The Loss of an Open API
One of Musk’s many attempts to monetize Twitter was the recent announcement that, after 17 years, Twitter API integration partners had a couple of weeks to either pay up or ship out. That’s the config change that likely broke the site earlier this week, but it’s also a dramatic shift for all of Twitter, which was founded on an open API ecosystem.
“Twitter, for me, is the most important API out there,” Lane, who is also known as the API Evangelist, said. “It’s the one I love, and the one I hate the most out there. It shaped the Obama presidency. It shaped the Trump presidency. It shaped the Arab Spring. It’s helped university researchers,” including the Twitter API team reaching out to universities to understand how they leverage the open API to do COVID-19 research.
“Twitter is very much the nerve center system of the world.” — Kin Lane, Postman
Perhaps an unpopular opinion, but Lane would like to see Twitter regulated like a utility.
Lane is not advocating against teams charging for API usage, but, he emphasizes, Twitter was built on an open API ecosystem and provides a public service. When Twitter launched back in 2006, he recalls, there was very little to it, which is why, for the first five years, it relied heavily on the Twitter open API community and hackathons to collectively extend it.
Unlike the Meta social app suite, Lane argues that Twitter has become much more like a public square or town hall. More recently, he has collaborated on a COVID-19 tracking app, where U.S. counties published news via Twitter handles — pointing to the open API’s public use. He has also collaborated to route people to shelters during Hurricane Sandy and other natural disasters — pointing to the API’s enablement of social good. The Twitter API also plays an undeniable role in the political world, including the significant number of bots that use it.
Yes, eventually closing an open API to squeeze revenue out of it is what Lane calls the “API Playbook 101 for Businesses.” But Twitter had already created a premium business tier back in 2016, while still keeping its openly accessible API ecosystem.
“He [Musk] thinks it’s a strategic decision. Saving money, maybe making new revenue because you’re charging people rather than giving it away for free. So it’s cost savings, a new revenue stream, but then he’s missing out on the multitude of other ways it’s hurting their business,” Lane said.
As these decisions are always at the intersection of technology, business and politics, he believes creating unreliable APIs can be a business strategy for Twitter, “because you don’t want everybody using the same APIs. You don’t want people having reliability and stability and comfort when they’re competing with you,” so you stop making sure there are people caring for the APIs, or the people who run them, or their users.
“It takes human beings with empathy to run APIs properly, and I don’t think that exists today at Twitter,” Lane laments. But, for now, he says, “I’m not giving up on Twitter.”