DevOps / Monitoring

Cyber Monday: Do You Know the Cost of Your System’s Downtime?

26 Nov 2018 10:15am, by

As Black Friday and Cyber Monday loom over eCommerce, threatening to take down your website with legions of bargain shoppers, chaos engineering firm Gremlin has calculated the exact cost of not preparing for this four-day shopping cart battle. Gremlin’s platform was created by former engineers at Amazon and Netflix that helps companies run chaos engineering experiments to avoid downtime and outages. They’ve now created a nerve-wracking eCommerce cost of downtime calculator that piles on the increasing cost of downtime at the top 25 U.S. online retailers with your time on page.

If Amazon were down for a whole ten minutes, it’d lose about $2 million. If America’s favorite brick-and-mortar Walmart’s website was down that long, it’d be out $400,00 in online sales.

Why such a terrifying message at a time of seasonal joy? Well, if you work on site reliability engineering, technical support or QA on an eCommerce site, you’re going to be on-call this time of year anyway.

According to Deloitte, online sales will increase between 17 and 22 percent this holiday season. That’s a lot of extra traffic straining your servers.

Downtime, especially for an e-commerce site, especially during this peak holiday traffic, directly equates to revenue loss.

According to IDC, for the Fortune 1000, the average total cost of unplanned application downtime is somewhere between $1.25 billion to $2.5 billion annually, making an average hourly cost of an infrastructure failure about $100,000 an hour, while a critical application failure is about $500,000 to $1 million per hour.

Gremlin explained that “Enterprise commerce businesses typically rely on a complex microservices architecture, from fulfillment to website security, ability to scale with holiday traffic, and payment processing.”

This leads to a lot that can go wrong, which will have customers quickly clicking somewhere else to find another sale that loads faster or doesn’t keep crashing. This is why downtime is so, so very costly for online stores. Evidently, for Amazon, even a second of downtime costs more than all the other top 25 retailers combined.

Systems resiliency can really help you decrease your shopping cart abandonment issues.

What Are the Basics of Systems Resiliency?

As software culture moves toward releasing faster and faster, and toward allowing almost anyone to build and release, on many more complex pieces of smaller disparate, distributed pieces of code, while we distribute the risk, we also complicate the reliability.

The movement of breaking away from massive legacy beasts in favor of building with microservices and miniservices has led to the rise of whole careers that go past reactive customer support and one-way QA. Site reliability engineering and chaos engineering have become necessities in all enterprises, not just their banking origins. These are whole jobs working toward preventing downtime and preventing security breaches — another risk that seems to peak around the holidays.

Change of some kind is inevitably the cause of most outages. Site reliability engineering or SRE was created as a way to forge the speed of IT with the deliberate stability of operations and to limit the liability of change. SRE is a job dedicated to thinking about the whole lifecycle of a piece of software — from design to decommission — and all the risks to its stability.

Chaos engineering, an off-shoot of SRE, is the art of breaking things intentionally by throwing a poo-filled kitchen sink at your systems before life does. Principle SRE at Gremlin Tammy Bütow already told The New Stack how it’s important to break your systems to understand what needs to be fixed.

But SRE and chaos won’t stabilize your systems alone. It’s important to have a really good incident management system in place and broadcasted visibly around your company, contributing to your improved communication and shared an understanding of your software. You also need really good monitoring in place and even observability. You need your on-call rotation set up. You also need to decide the guinea pig service that will first be hit by rampant chaos. And, finally, to understand the value of these tools and processes, you need to have a really good idea of the business impact of your downtime.

How Do You Measure Downtime?

It’s important to consider the cost of your downtime early and often. With these numbers, developer teams can better persuade and “sell” the need for these tools and practices to the business side, which in turn will push DevOps, CI/CD and faster more stable release goals of the whole business.

Start by bringing development, business and customer-facing support or sales in a room together. Note how much your service costs your users — the higher the price, the greater the assumption of reliability and uptime. Next, what is your customer acquisition cost? Probably marketing can help with this one. Factor in that it’s five times harder to get a new customer than to keep a current one, making customer retention the key to long-term success. Plus with bad reviews online, you won’t be attracting new customers for long anyway, so don’t forget to also calculate for loss of reputation.

Next, factor in the cost of productivity lost and of on-call engineers focusing on fixing things already built instead of making improvements. Some enterprise surveys have put that productivity loss at about two-thirds of the total cost.

If your downtime could be something penalized because it breaks a privacy or security regulation, factor that fine in too.

Most important, you need to understand what you would be losing. For eCommerce, it’s fairly easy to calculate based on average sales per minute.

Calculating the cost of your downtime isn’t precise. Gremlin offers this equation as a starting point:

R (lost revenue) + E (lost productivity) + C (what you owe to customers, like SLA) = COD (Cost of Downtime)

No matter how you calculate it, the discussion of what that downtime could cost and the importance of preventative measures like chaos engineering and site reliability engineering is essential to the future of your business — if for nothing else than breaking down silos between business and IT and starting a conversation.


A digest of the week’s most important stories & analyses.

View / Add Comments

Please stay on topic and be respectful of others. Review our Terms of Use.