How Bad Are System Failures and Security Incidents?
Really, really bad, according to the Second Annual Verica Open Incident Database (VOID) Report.
We all know things are getting bad out there when it comes to system failures and security incidents. Just ask anyone who depended on Azure or any of Microsoft’s dozens of services on Jan. 25. Recently, Verica, a company that uses Chaos Engineering to help companies make their systems more stable and secure, announced the results of its Second Annual Verica Open Incident Database (VOID) Report, to look at the current state of incidents.
VOID is based on nearly 10,000 incidents from just under 600 companies ranging from the FAANG and Fortune 100s to startups. Rather than simply the usual collecting of corporate post-mortems and status updates, the VOID researchers dig into software incident reports scattered across the internet, from scrolling status pages to reports sequestered in obfuscated corners of company websites.
Verica does this and shares its findings because it can “keep us ahead of the potentially catastrophic consequences of a software-driven world.” They’re not wrong. Every day sees another problem, another failure. For everyone like SolarWinds or Log4j that can’t be ignored, dozens of other, quieter failures nonetheless demand attention.
In this latest report, they found:
- No company is immune from incidents. Incidents happen in organizations of all sizes, from startups to the Fortune 10. Software is mission-critical in every possible industry, including banking, travel, agriculture, commerce, and more.
- Incident duration isn’t as cut and dry as it appears: there are many insightful metrics to measure in an incident. The duration of incidents conveys little meaning about the incidents themselves, in part because it can be very tricky to attribute when incidents start or stop.
- SREs and others in similar roles should retire Mean Time to Repair (MTTR) as a key metric. Why? Because MTTR isn’t a viable metric for the reliability of complex software systems for many reasons, particularly because averages of duration data lie.
- Organizations are moving away from shortsighted approaches like Root Cause Analysis (RCA) appears to be on the decline everywhere as organizations move toward more meaningful metrics and analysis.
- Businesses should take the time and effort to invest in analyzing and writing up incidents. This practice helps organizations better understand their systems and how to make them less troublesome in the future.
Study the Detail
Some of this is surprising. But then, no one has studied so many incidents in such detail before. When you look closely at real data, you discover more about what’s really going on in any subject.
As Courtney Nash, lead research analyst at Verica and the VOID’s creator, said, “We were surprised to find no relationship between the length of an incident and how ‘bad’ it was. We have heard from many people who suspected that longer incidents were perhaps somehow worse/harder to resolve. Conversely, some people thought that for really severe incidents, a company would have all hands on deck and resolve such incidents more quickly. Companies can have long or short incidents that are very minor or quite serious, and every combination in between. Not only can duration not tell a team how reliable or effective they are, but it also doesn’t convey anything useful about the impact of the event or the effort required to deal with it.”
Want to know more? Download the full report. You may just learn something that will help you when — not if — your next problem erupts.