6 Lessons Learned from Netflix’s New Year’s Eve Outage
I recently had the opportunity to sit down with Jeremy Edberg, MinOps CEO and previously a site reliability engineer (SRE) for Netflix and Reddit, as well as Liran Haimovitch and John Egan, founders of Rookout and Kintaba respectively, to talk about incident management best practices.
In the tech industry, we feel like we are on the cutting edge, but in reality we are often playing catch-up to other industries. For example, the aviation industry has already learned that trying to reduce your incident count is counterproductive when trying to become more resilient. In reality, what you want is a culture that embraces incidents. A culture that files problems early and often, distributes the learnings, and in turn drastically reduces the chances for SEV0 or SEV1 disasters.
To open up the discussion, Jeremy talked about a Netflix outage he experienced in 2012. It was one minute past midnight on New Year’s Eve when he received an alert that user signups were broken. His gut was telling him that the issue must be time-related, but he was continuously being assured that couldn’t be the case, as everything was in Greenwich Mean Time (GMT) and anything time-related would have broken eight hours before.
After three hours of troubleshooting, they found the problem: The user signup flow required a database table to be created once a year because it stored a log of the creation in Pacific Time. No one had created the table before midnight, so the system broke when it couldn’t find the table it was looking for.
Jeremy was tempted to say “I told you so,” but as we all know, that isn’t productive during a retrospective. So in the interest of productivity, I’ve put together six takeaways about modern incident management from the conversation:
1) Trust your gut. Often it’s the fear of being wrong that prevents us from taking action. No one wants to posit an incorrect theory, let alone set off a fire alarm that pages everyone. But creating a positive incident culture means everyone should feel empowered to be open and speak up.
2) Declare early and often. Creating a positive incident culture also means that issues are filed early and often. This is also known as the “big red button” — since the 1950s, factories have had big red buttons that can be pressed at any time by anyone. The insight here is that what you actually want is your incident count to go up, by increasing access to declaring them, because addressing them early will prevent them snowballing into SEV0/SEV1 disasters.
3) Involve the entire organization. We have a tendency in the tech industry to silo the responsibility of resilience to SREs. But the truth is that problems come from everywhere, so any employee should be able to press that big red button. Yes, oftentimes a problem is identified inside a Datadog dashboard or PagerDuty alert. But they can also be flagged inside a support ticket or a customer complaint. Giving everyone the keys to declare an incident means that problems will be surfaced, and resolved, much faster. Moreover, after a problem is addressed, it shouldn’t be the job of SREs to handle everything from looping in customer reps, legal and PR as the incident unfolds. Adopting modern tooling should help orchestrate a lot of that process.
4) Developers should be on the hook for reliability. Bad code deployments are a leading cause of SEV0 or SEV1 incidents, according to Gremlin’s “State of Chaos Engineering Report.” Long gone are the days when developers write code, then throw it over the wall to operations. They need to have skin in the game and be on the hook for that code being reliable. This means adopting modern tooling like observability and live debugging for more effective troubleshooting and root cause analysis.
5) Automate what you can: As the Netflix story demonstrates, if creating a new table before midnight is a necessary repeatable task, why not automate it? Jeremy explained in the conversation that because it was a simple task, everyone just assumed that someone else would do it. In an ideal world, predictable and repeatable tasks should be automated, saving the manual work of incident management for the truly unique, black-swan, unpredictable events.
6) Read more postmortems. After every incident, there should be an effort made to document, in at least one sentence, what happened and one takeaway action for preventing it from happening again. These are conversations that happen around water coolers or are kept in one engineer’s head, but especially in a remote world, it’s important that these learnings are documented and distributed. One big myth about modern incident management is that postmortems need to be long complicated documents with multiple data fields. But the truth is, that often serves as a deterrent for reading the postmortem — or ever writing it in the first place. Getting something down, even if it’s just a couple of sentences, from the person who was there when the incident happened is crucial to improving resilience. NASA is known for actually reading other companies’ postmortems, because the agency is hungry for more learnings that aren’t being generated internally. (Check out postmortem.io if you want to build up this habit yourself).
As you can see, resilience is not just a companywide effort, but an industrywide one too. Reading about Netflix’s New Year’s Eve outage may very well convince some of you to make sure that a new database table is automatically created before midnight! These are the kinds of learnings we can share with one another, simply by being more open and transparent.
Listen to the full story of Netflix’s New Year’s Eve outage here: