What You Can Learn from the AWS Tokyo Outage
In the movies, it seems like Tokyo is constantly facing disasters — natural ones in the forms of earthquakes and tsunamis, and unnatural ones like giant kaiju and oversized robots. On the morning of Sept. 1, the mechanized behemoth was Amazon Web Services.
At around 7:30 am JST, AWS began experiencing networking issues in its AP-Northeast-1 region based in Tokyo. The outage affected business across all sectors, from financial services to retail stores, travel systems and telecommunications. Despite the troubles with not being able to access money, purchase goods, travel or call each other, the Japanese people demonstrated resilience, proving that at least some things from the movies are true. However, the financial losses due to the outage are expected to be huge.
After the six-hour outage, AWS explained the issue noting: “This event was due to a problem with multiple core network devices used to connect network traffic using Direct Connect to all Availability Zones within the AP-Northeast-1 region.” AWS Direct Connect is a service that creates a dedicated network connection between AWS’s infrastructure and its customers’ own on-premises systems. But the details surrounding these “core network devices” and what problems they faced are still unclear.
Implications for Everyone
AWS’s outage was just another in a growing number of incidents caused by underlying networking issues, including Akamai’s outage in July and Fastly’s incident in June. When we consider these incidents and how massive their effects on businesses are, it surfaces a bit of a paradox: Our systems are simultaneously becoming more robust and less reliable.
When we consider our computing resources now compared to what they were several years ago, it’s clear that they are more powerful and generally more reliable. Software and hardware have vastly improved. But the paradox is this: Our applications are more complex and now rely on more external services and systems, and this increases our overall susceptibility to failure. Whereas the individual pieces that we control may be robust, the dozens or hundreds of dependencies beyond our control have introduced more risk.
So how do you mitigate this risk?
The common solution is replication: add more availability zones, duplicate new regions or use multiple cloud providers. Replication at the right scale is useful, and in this incident, replication to a region outside of AP-Northeast-1 may have been enough to mitigate the issues that many companies experienced. But replication also introduces additional complexity.
A more pragmatic approach is to use chaos engineering to proactively simulate network failures, get a better understanding of the ramifications and determine what response or replication efforts are necessary to mitigate the risk and impact of these failures.
Understanding the Chaos of Complexity
The term chaos engineering has come to encompass a long-standing practice of using failure injection to identify and test risks in technical systems. In fact, some of the earliest adopters of chaos engineering (before it was even given that name) were engineers at Amazon.
In its current practice, chaos engineering involves five steps that match the scientific process:
- Observe the system to identify potential risks and collect baseline data.
- Create a hypothesis of how the system will behave in response to a specific failure.
- Introduce the failure and collect data.
- Analyze the data.
- Share the results so the experiment can be replicated and iterated upon.
The 2021 “State of Chaos Engineering report” published by Gremlin — with contributions from Dynatrace, Epsagon, Grafana Labs, LaunchDarkly and PagerDuty — notes that dependency issues were one of the most common causes of incidents. The researchers also found that there was strong uptake in simulating network outages to test for these issues.
Following the chaos engineering process to address this latest AWS outage would start with an observation of your current systems. Where are your on-prem services using AWS Direct Connect or other AWS services, and what risks do those services pose to your business or customers if they were negatively affected by an AWS outage?
Crafting a hypothesis can be simple: What do you expect to happen when the outage occurs? However, as you document your expectations, it’s best to think more broadly about your systems and include monitoring, alerting, incident response actions, and business continuity/ disaster recovery plan (BC/DRP) in your hypothesis.
There are many ways to inject failure: disconnecting ethernet cables, adjusting networking configurations on servers or network devices, or using dedicated chaos engineering tools like Gremlin. The important consideration is to choose a method that allows you to quickly restore service if necessary. You don’t want to create an outage while trying to prepare for one.
In my experience with chaos engineering, analyzing the data has come easily. Problems that arise from failures are apparent and typically fall into two categories:
- Things we can fix, such as configuring automatic failover or restarts, adding queuing and caching systems to further decouple systems, or improving retries, timeouts and circuit breakers.
- Things we can’t fix, but can create processes around. This includes improving monitoring and alerting to spot issues faster, updating playbooks and response plans, and improving communications for responders.
One of the challenges with our complex systems is that they’re also dynamic. New code, new deployments, and new services are constantly changing the environment. So the final piece of the chaos engineering process is to share and iterate — not only so that we can keep improving, but also because we know our systems will be different tomorrow.
AWS’s outage in Tokyo is just one example of a major incident, but because modern applications are so reliant on third-party services, it’s important that we prepare for these and similar outages. It’s unlikely that we’ll ever reduce the complexity as our systems become even more interconnected, but by practicing chaos engineering, we can take proactive efforts to better understand and mitigate the risks.