How to Supercharge Your Disaster Recovery Plan
It’s every engineer’s worst nightmare: Our cloud provider has a sudden outage, causing your system to fail, product to malfunction and angry customers to tweet up a storm that your service is down. Such disruptions can cause serious repercussions for your credibility as a business and put the reliability of your product into question.
This nightmare scenario is on the mind of every engineer, and major cloud providers are taking note. In fact, AWS CTO Werner Vogels talked about design for failure architecture, stating that a data center might be interrupted one day, no matter how good you or your cloud vendor are at data center operations. And of course, his predictions have come to fruition.
We’ve seen even the largest and most successful companies fall victim to outages — the recent failures at Facebook, Slack and AWS are some of the most prominent examples. While not all outages can be attributed to the cloud, the recent example from AWS has proven that having viable and proactive business continuity (BCP) and disaster recovery (DR) plans as well as runbooks on each can make all the difference.
While BCP and DR are often grouped together, business continuity tends to be more common and less labor-intensive than disaster recovery. BCP generally refers to your typical cloud outage, while DR refers to a situation in which all your data is completely destroyed due to malicious actors or other destructive events. For BCP plans, it’s usually adequate to have more than one copy of your data and servers, whereas DR plans require you to have more backups and protocols in place.
Another important aspect to determine is your recovery point objective (RPO) and recovery time objective (RTO). RTO is the amount of time your business can afford to disconnect when recovering from disaster, whereas RPO is the amount of data that you can lose to a disaster (for example, 24 hours) without damaging your business’s reputation or breaching your service-level agreement (SLA).
So now that these important factors have been established, how can you safeguard your organization in the event of another cloud outage or any other issue that may arise? Here are some steps you can take to prepare and restore your service once the worst-case scenario occurs.
Create a Multiavailability Zone Deployment
The easiest and most common architecture for BCP is to use at least two availability zones (AZs) within the same region. For example, on AWS, each region is built out of three AZs, which are located relatively close to one another and are connected via a dedicated fiber and low-latency connectivity. This allows you to keep your service afloat so you can continue to serve your customers when one AZ fails.
Cloud providers tend to spread their services across multiple AZs (for example Amazon S3, Amazon DynamoDB, Google Cloud Spanner, etc.). Hence, they are built to handle AZ failure by design.
Take into consideration that such architecture may involve inter-AZ costs that need to be accounted for during the design phase.
Use a Single AZ in a Multiregion Deployment
In this scenario, you are implementing your application and databases across one AZ in two different regions. This enables you to have your service available when one region experiences downtime.
These two AZs can be deployed in a few ways:
- Each region will service 50% of the workload using a load balancer or DNS (Domain Name System) routing.
- The main region will serve most or all of the traffic, and the second region will be there to serve the users in case of a failure. If you choose to go this route, you may want to automate this failover task.
Use a Multi-AZ and Multiregion Deployment
The most recent AWS outage took place in the Northern Virginia (US-East-1) region and affected the entire region due to networking impact. If all your essential workloads were running in that region, your services were inevitably going to be affected. This means all your service would be out until the AWS services in that region were restored. Talk about putting all your eggs in one basket! This is a rare situation in which a network failure affected more than one AZ.
The best protection for such a scenario is to run different workloads and backups in various locations so if one region goes down, you can continue to serve your customers from a different region.
Of course, the more regions you’re running across, the more complexity is added to your environment and the more expensive your cloud bill can become. So be strategic about where and how many regions you want to build on, taking into consideration how critical your product actually is. Is it lifesaving? If so, you’ll need to go the extra mile to ensure your product functions in all situations and therefore warrants diversifying regions as much as possible.
Run on a Multicloud or Hybrid Deployment
Running on more than one cloud provider has become increasingly popular. But the reality is that it’s quite hard to maintain more than one cloud environment, and this challenge becomes even more difficult when you use managed services. For example, if you are using Amazon DynamoDB, similar solutions are unavailable through other cloud providers.
As a result, the industry trend is to run each workload on one specific cloud (on one multi-AZ or multiregion architecture), but enable different workloads to run on various other clouds — meaning you split some of the service between two different cloud providers. In such a scenario, not all the systems are down when an outage occurs.
Alternatively, another common practice is to have a DR site on premises. When customers migrate their workloads to the cloud, they tend to use the on-premises as a DR site, so if something goes wrong, they will be able to run some of the critical services locally. This is a good practice when your company’s cloud maturity is in the initial stages and you are not using managed services.
While all the solutions above are ideal for BCP, they won’t cover you in a situation where your data is being encrypted or deleted, which would require a solid DR strategy. So in addition to having more than one implementation for your service, you may need to have a solid backup plan that will meet the company’s RPO and RTO requirements.
There are many third-party backup solutions and cloud backup services available, for example, AWS Backup, that can assist you with automating backups and saving them in a separate account and region. You should guard the backup account like you’re protecting a vault so it can withstand a situation in which you’re being attacked and your data is encrypted.
If such an attack occurs, the data on the backup account will be used to restore your services so you can continue servicing customers.
Now that we’ve covered the most common methods for constructing a BCP and DR plan for cloud environments, let’s discuss the tools that can be used to implement these methods:
Leverage Infrastructure as Code
Infrastructure as code or IaC enables the automated configuration of your environment. Once you configure the parameters you want to use, it will be saved into a master file, otherwise known as a manifest. From there, your environment can be automatically recreated for testing, disaster recovery or a variety of other situations.
Use Scaling Rules for Containers
If you’re using containers, implementing scaling rules based on various metrics can be tremendously helpful. You can scale up to increase clusters in the same block or scale-out, which would duplicate instances. By implementing scaling rules, you can easily back up and restore your container-based applications so you can retrieve all important workloads if there’s an outage. Ideally, you’d need to scale both up and out on the container and instance level for this to be most effective.
Reroute DNS Requests
If servers in one location are down, you can reroute all requests to various other locations where you’re running your services. DNS providers such as Cloudflare can be configured to detect when a system is down and automatically perform geolocation-based rerouting.
Likewise, you can also set up container orchestrators to define and automate various rules for rerouting requests. We recommend implementing autoscaling as well to ensure availability and using Amazon ECS for the implementation of rerouting requests.
Set Up a Pilot Light to Run in Multiple Locations
Another recommended business continuity strategy is to run a pilot light, which is essentially a replicated version of your workload that’s running on standby in a different region. If a disaster occurs, all your data will be sitting there, ready to be set up. Simply deploy your infrastructure and scale your resources after an incident, and your product should be up and running without too much delay.
If you cannot afford any downtime whatsoever, you may want to consider running warm standby instead. According to AWS, “The warm standby approach involves ensuring that there is a scaled-down, but fully functional, copy of your production environment in another Region. This approach extends the pilot light concept and decreases the time to recovery because your workload is always-on in another Region.”
While significantly more expensive, warm standby will be up and running faster than a pilot light, as there is no infrastructure setup needed.
Hope for the best and prepare for the worst is one of the most important rules to live by when it comes to disaster recovery. There is no such thing as being too prepared as cloud outages can happen to the best of us with no notice whatsoever.
By following the above strategies, you can ensure your service is available in multiple locations, can easily divert traffic to unaffected regions, and is backed up and ready for action should a disaster strike. Best of all, you can breathe easier knowing you won’t have to wake up in the middle of the night for any emergency configurations. Your business’s functionality and credibility are maintained to the highest standards.