Tuesday afternoon’s outage at the US-EAST-1 availability zone of Amazon Web Services’ S3 cloud service, which may have lasted as long as four hours, has yet to be diagnosed. The following Wednesday afternoon, one of several threads on Amazon’s support forum asking for a statement from the company was left officially “not answered.” One customer did leave this remark, though: “Being silent for a day after this huge outage is not cool at all. You need time, say it. Most of us will understand. We’re the ones who have our own post-mortems to do complete after we hear what yours is.” (Editor’s note: AWS has since posted a summary of the root cause of the outage).
But one Amazon AWS customer — one who uses servers in the targeted availability zone — does not have a post-mortem to conduct. Employee engagement service provider Spire was impacted by the outage, although one of its lead engineers — Rob Scott, Spire’s vice president for software architecture — reported Wednesday that its Kubernetes-driven service deployment instantaneously mitigated that impact, moving its active nodes to other EC2 availability zones, while his team sat back and watched with delight.
“Realistically this kind of failure on a day like Tuesday could have resulted in hours of downtime,” wrote Scott in a Medium blog post Wednesday. “With Kubernetes, there was never a moment of panic, just a sense of awe watching the automatic mitigation as it happened.”
Scott would have had a story to tell if there were more events to it. As it turned out, Spire suffered zero downtime, by his account, due to his data center’s state of readiness.
He credits the use of kops, Kubernetes’ command line tool for automating the deployment of entire clusters — including on EC2 — as critical to the resilience of Spire’s services during S3’s downtime.
“Kops makes it remarkably straightforward to provision multiple availability zones,” Scott told The New Stack. “As part of the create cluster command, you can simply choose the availability zones you’d like your instances provisioned in.”
In his blog post, Scott advised readers to use kops to apportion nodes across multiple availability zones and to ensure that each single node has the ability to fail over to another node. The danger, he said, in not apportioning enough failover capacity is that utilization rates could cross the 60 percent tolerance level, conceivably rising to an eyebrow-raising level of 80 percent.
We asked Scott, what would the dangers have been for Spire had utilization crossed that threshold?
“With Kubernetes, there was never a moment of panic, just a sense of awe watching the automatic mitigation as it happened.”
“When our utilization levels get past 80 percent, we start to trigger alarms because that just doesn’t leave us much room to grow,” the Spire VP responded. “Although 80 percent utilization likely won’t be very noticeable, 100 percent definitely would, and when you start to get close to that you’ve got to watch out.
“If you’re close to 100% utilization, a spike in usage on any system could mean other systems slow down or possibly even crash,” he continued. “By definition, your utilization will be somewhat varied, so it’s important to leave some room for spikes in usage. This will also depend on the size of your nodes. If you’re using something like a t2.medium instance on Amazon with 4 GB of RAM, 80 percent utilization means you really wouldn’t have much RAM left for any kind of spike. On the other hand, if you’re using an m4.2xlarge with 32 GB of RAM, 80 percent of utilization would still leave you with over 6 GB of RAM left. I guess that’s all to say, 80 percent is not some kind of perfect number for everyone; it’s about allowing some extra space for varied utilization.”
Scott’s blog post also advises the use of readiness probes — another native feature of Kubernetes that serves as a heartbeat mechanism. The orchestrator stops sending traffic to any pod equipped with a readiness probe that is not responsive. It may then start launching a new pod to take its place.
Would Spire have benefited from the use of a more comprehensive monitoring system, such as Prometheus? Or do readiness probes provide all the preventative measures that Spire needs?
“Even if a monitoring tool was able to redistribute nodes,” responded Scott, “I would be hesitant to give that kind of control to a tool like that. On the other hand, there definitely is significant value in powerful monitoring solutions like Prometheus. Kubernetes is not perfect, and there will certainly be times that a third party tool will pick up on anomalies that Kubernetes might miss. With all things in Ops, redundancy is key. Relying solely on Prometheus or readiness probes would be short-sighted. The best approach will always be one that still works if one of the pieces fails (whether that’s a server or a monitoring tool).”
A sizable chunk of the response to the S3 outage on social media boiled down to, “Chill out, people, these things happen.” Rob Scott’s story serves as some much-needed evidence that these things don’t happen — at least, not to organizations that have resilience plans in place.
[This article was edited to reflect clarifications made by Rob Scott Thursday afternoon.]
Feature image: The Survivor Tree, which survived the bombing of the Murrah Federal Building in Oklahoma City in 1995, by Dual Freq, licensed under Creative Commons.