Culture / DevOps / Technology / Contributed

6 Scary Outage Stories from CTOs

29 Oct 2020 6:45am, by

Adam LaGreca
Adam LaGreca is the founder of 10KMedia -- a boutique PR agency for B2B DevOps. Prior, he was the Director of Communications for DigitalOcean, Datadog, and Gremlin.

You’re sound asleep when the alarms go off. It’s 3 a.m. You wipe your eyes, check your phone. You know something is wrong. Very wrong.

The website is down. Your application is broken. The only light in the room is coming from your computer monitor. The Gremlin in the system can be hiding anywhere, and it’s your team’s job to find it.

And fix what’s broken, fast.

As someone who runs PR for various DevOps startups, I’ve seen this story play out over and over. The reputation cost alone of a major outage is enough to instill fear in even the most seasoned engineer!

But the truth is, every company has system failure. And we’re still a bit away from getting online systems to look more like utilities, where you flip a switch and it just works. So sharing stories and normalizing failure (e.g. transparent and blameless postmortems) are positive trends for the industry; it makes everyone feel less scared and alone.

I’m not going to cite generic numbers about the cost of downtime. For Amazon it may be millions per hour; for your company, it may be confined to a frustrating customer experience, if dealt with swiftly. But ultimately these kinds of situations lose businesses money, hurt reputations, drain engineering resources and fuel interest in the competition.

So in the spirit of Halloween, and more importantly in the spirit of sharing experiences to better prevent them from happening in the future, let’s take a look at six scary outages stories, as told by CTOs themselves.

Charity Majors, CTO of Honeycomb

Charity Majors

“Push notifications are down!”

“No, they aren’t.”

“No really, people are complaining — push is down.”

“Push can’t possibly be down. Our pushes are in a queue, and I am receiving pushes.”

“It’s been five days, and push is STILL down. People are filing all kinds of tasks.”

… so I reluctantly started poking around. All our push metrics looked relatively normal, every test push I sent was promptly delivered. Yet the support team was right — people had been steadily complaining for five full days about pushes not succeeding. What on earth could it be?

These were Android push notifications, and Android devices needed to hold a socket open to the server to subscribe to push notifications. We had tens of millions of Android devices, so we ran the push notification service in an autoscaling group. To load-balance connections across the group, we used round-robin DNS, and to increase capacity we would simply increase the size of the ASG [auto-scaling group]. Eventually, we figured out that the complaints had begun right around the last time we increased the size of the ASG, so that was a good clue. Another clue was that all the people complaining seemed to be in Eastern Europe. We asked a few of them to run a verbose trace, and that’s when we learned that the DNS record was coming back as … missing?

Turns out that when we increased the size of the ASG, the round-robin DNS record exceeded the UDP packet size. Normally this is no big deal; the protocol says it should fall back to using TCP in that instance. And it did, for almost everyone. Except for users behind one major router in Romania. We delegated DNS for that record from route53 to a small local python DNS server that let us return a random subset of four Android push notification servers, and everything was fine again. 💀

Matthew Fornaciari, CTO of Gremlin

Matt Forniciari

The outage occurred on a Friday afternoon, just as we were about to head out to Halloween Happy Hour. The page came in that we were serving exclusively 500s — a bad, bad experience for customers. After some digging, we realized that our hosts had filled up their disks, and we started failing because we couldn’t write logs (also scary because we were flying blind).

We ended up refreshing the hosts, implementing log rotation to prevent that from happening in the future, and creating an alarm to warn us if we were ever getting close again. But the most interesting thing we did is have one of our engineers write a new Gremlin for our platform: the disk Gremlin to make sure we could proactively exercise the fixes to make sure we never failed that way again. Then we automated that test and that test still lingers, running randomly in our production environment to this very day. 😱

Liran Haimovitch, CTO of Rookout

Liran Haimovitch

Remember that urban legend about a server going down everyday, at the same specific hour? And after weeks of investigations, someone looked at the security camera footage… and found out that the maid was disconnecting the server to connect the vacuum cleaner! Well, we all know that the Gremlin in the closet isn’t always as scary or mysterious as we initially think :)

Recently, we experienced something similar.

Several times a week, we’d been seeing the backend’s latency metrics going through the roof. And each time we investigated it, we noticed one of the tables getting locked and queries kept timing out all over. We wondered: Is one of our customers redeploying their application non-stop? The main suspect was a complex query which fetches the list of all our customers’ servers’ information, so they’ll be able to choose which of them they would like to debug. We started optimizing that query and saw huge improvements, yet those latency spikes kept happening.

Then a couple of weeks ago, while attending the weekly “Customer Success Briefing,” the latency spike was happening again and it hit me like a brick. I noticed a query that we barely used, from our application’s back office, that was really slow because we never prioritized fixing it (it was scarcely used). Apparently, our customer success manager had been collecting the data for the meeting, and every time the query didn’t return fast enough, he just kept hitting refresh and retrying. That rarely used query was locking up our database and challenging our customer success manager’s sanity! Looking back at the data, we confirmed that all of the latency peaks were aligned with Customer Success briefings. Eventually, after about 20 minutes of optimizing that query, everything returned to normal. 🎃

Daniel “Spoons” Spoonhower, CTO of Lightstep

Daniel “Spoons” Spoonhower

It was a clear, sunny day in San Francisco. I was working at a small internet company, when suddenly our app stopped loading for me. Not just one view, but the whole app. Hard reload, but no luck. I looked around and my teammates were also confused; the app wasn’t working for them either. Our users weren’t complaining (yet?) but we started digging in anyway. No deployments had happened yet that day, no infrastructure had changed; yet it was broken consistently across OS types and browsers. What could have changed?

We found some errors in a critical (but boring-and-hadn’t-changed-in-forever) API call, without which the app wouldn’t load. But why were the errors only happening for people that worked at the company? And why now? It turned out that for internal users, the API returned some extra data…extra data that had been slowly growing over the last few weeks, until it had finally exceeded the request’s maximum payload size that afternoon. 👻

Lee Liu, CTO of LogDNA

The AddTrust Root Certificate Authority (CA) we relied on expired at roughly 4 a.m. Pacific Time on Saturday morning, May 30, 2020.

At the time, we were transitioning some of our infrastructure to Let’s Encrypt, a nonprofit certificate authority, as part of our move to Kubernetes. Legacy Syslog clients required AddTrust/UserTrust/Comodo. We run our own SaaS environment in addition to a number of worldwide environments for a major cloud partner. In our SaaS environment, a single certificate chain is used everywhere, including our ingestion endpoint, Syslog endpoint, and web app. We thought we were ready for this root certificate expiry… we were not.

Quick primer on certificate chains: All certificate-based security relies on chains of trust. Browsers and operating systems ship with these trust stores of root certificates.

LogDNA Chain: AddTrust Root CA (expired May 30) -> UserTrust CA -> Sectigo -> *.logdna.com

Modern browsers allow: UserTrust CA -> Sectigo -> *.logdna.com

UserTrust CA itself is also part of root trust stores of many browsers, so even if AddTrust is expired, it’s ignored since the chain leading up to the UserTrust CA is still valid.

Or so we thought.

Turns out, old legacy systems will only see the LogDNA chain, which is considered an invalid chain if any of the four certificates are expired. They also don’t recognize UserTrust as a trusted root certificate.

All of the support tickets we received mentioned that our v1 agent was no longer sending logs to our ingestion endpoints, but our v2 agent and other modern implementations of REST API-based clients were all working fine.

We erroneously started working on an update to our v1 agent. Ironically, our CI/CD provider also had an outage of their own due to the same AddTrust Root CA expiration, which further complicated our rollout of that agent. Once we realized that the issue was with the actual certificate chain and how older legacy systems behaved with that chain, we quickly rectified it by switching in a new certificate chain based on Let’s Encrypt. 🧟

Tina Huang, CTO of Transposit

Tina Huang

Full-on site outages are horrible — but they don’t make your skin crawl the same way that the random, unpredictable failures really can. I was working on the mobile web version of Twitter, and we got requests that, for some random unlucky campers, caused a scary error page whenever they visited the site. For everyone else, the sky was blue and the birds were chirping. But now and again, someone else would get hit. And, once they were hit, they were stuck in a pit of despair, unable to read any tweets from their phone.

Slowly, as the number of these tarnished accounts increased, the 500s started creeping up to critical levels. We were able to see that the new library we were using failed to parse session cookies with a specific character. So every time you logged back in, you were rolling the dice on getting bit by this pesky bug, and you couldn’t be cured without the wizardly powers to reset your cookies on a phone. Eventually, we fixed the bug in the library, and everyone was able to go back to reading their tweets… which, as we know, can be a very scary thing on its own! 🕸️

A newsletter digest of the week’s most important stories & analyses.