The Rise of Continuous Resilience
I sometimes ask a CIO whether they have a backup datacenter. Most will say yes, as it’s a normal part of a business continuity plan for disaster recovery. In some industries, like financial services, it’s a regulated requirement and there’s an annual visit from an auditor to make sure it’s in place.
When I ask how often they test their failover process, people start to look uncomfortable. Some admit that they have never tested failover, or that it’s too much work and too disruptive to plan and implement the test. When I ask what the failover test looks like, it’s normally one application at a time, in a planned migration from the primary data center to the backup. It’s rare for people to test an entire data center by cutting its power or network connections at an arbitrary time. I did hear once from a bank that had two data centers and switched between them every weekend so that one data center was primary on even-numbered weeks and the other was primary on odd-numbered weeks. If they ever had a problem mid-week, they knew what to do and that they could rely on it working smoothly. I’ve asked this question a lot over several years, and have had only a handful of good answers.
The combination of cloud computing and chaos engineering is leading to “continuous resilience.”
Disasters don’t happen very often, but unfortunately, when data centers lose power, drop network connectivity, lose their cooling system, catch fire or are flooded, the whole data center goes offline. Usually at an inconvenient time, with little or no warning. During Hurricane Sandy, the storm surge flooded basements in Jersey City and Manhattan. It turns out that computers don’t work underwater, and they still don’t work once the water recedes — as they are then full of mud and other debris. Even if the data center isn’t in the basement, sometimes the backup generators are, or the fuel tanks for the generators, or some critical network equipment. There are lots of examples of disasters like this in the press, so why are so many companies that have a business continuity plan in the news when disaster strikes?
The short answer is that it’s hard to get a disaster recovery data center implemented, and too much work and too risky to test it frequently. Each installation is a very complex, fully customized “snowflake.” The configuration of the two data centers drifts apart, so that when the failover is needed the failover process itself fails and the applications don’t work. Even such basic things as backups for data need to be tested regularly, by attempting the restore process. I’ve heard of some embarrassing data loss situations, where the backups failed and this wasn’t discovered until months later when a database failed and a restore was needed.
I call this “availability theater” — everyone is going through the motions as if they had a real disaster recovery plan, but it’s all play-acting.
So how can we make this better? There are two technology trends coming together to create a more productized solution, that is tested frequently enough to be sure it’s going to work when it’s needed. The combination of cloud computing and chaos engineering is leading to “continuous resilience.”
The last big project I worked on when I was Cloud Architect at Netflix in 2013, was their active-active multiregion architecture. At that time we tested region failover about once a quarter. In subsequent years, Netflix found that they needed to test more often to catch problems earlier. They ended up doing region evacuation testing about once every two weeks.
Availability Theater: Everyone going through the motions as if they had a real disaster recovery plan, but it’s all play-acting.
In these tests, they drain all the traffic from a region and show that they can still run Netflix on the two remaining regions; and no-one notices! (Netflix operates from AWS regions in Virginia, Oregon and Dublin). The only way to get to this way of operating is to set up the failover testing first, then test that every application deployed into the environment can cope — a “Chaos first” policy.
The reason Netflix was able to implement this is that cloud regions are different to data centers in two critical ways. Firstly, they are completely API driven, and the entire state of an AWS account can be extracted and compared across regions. Secondly, the versions and behaviors of the AWS services in each region don’t drift apart over time the way data centers do.
Most customers will have a mixture of multiregion workloads. Some customer-facing services, like a mobile back-end that needs to be online all the time, can be built active-active, with traffic spread across multiple regions. Workloads like a marketplace, which needs a consistent view of trades, are more likely to be operating in a primary region — with failover to a secondary, after a hopefully short outage.
Netflix is also a leader in Chaos Engineering. It has a team that runs experiments to see what happens when things go wrong, and to make sure that the system has enough resilience to absorb failures without causing customer-visible problems. Nowadays more companies are setting up chaos engineering teams, hardening their systems, running game day exercises, and using some of the chaos engineering tools and services that are developing as the market matures.
AWS has been investing in our services to provide support for multiregion applications, for both active-active operation and primary-secondary failover use cases. In the last few years, we’ve added cross-region data replication and global table support to Amazon S3, DynamoDB, Aurora MySQL and Aurora Postgres. AWS also acquired a disaster recovery service called CloudEndure, which continuously replicates application instances and databases across regions. We’ve also extended AWS Cloudwatch to support cross-account and multiregion dashboards.
As usual, we are listening to what our customers ask for, and our partners and Professional Services teams are working with customers as they migrate their business continuity plans to AWS.
Readers might also like the Architecting resilient systems on AWS session from AWS re:Invent 2019.
Feature image via Pixabay.