The Architect’s Guide to Multicloud Business Continuity
Modern infrastructure is all about availability.
- Consider an online mortgage business. A provider has 30 seconds to provide a quote to a prospective buyer or another provider will win the customer’s business.
- For online retailers, the weeks between Thanksgiving and New Year’s account for up to 25% of their annual revenue.
- The SLAs that customers expect from an enterprise SaaS provider is typically much higher than those guaranteed by the cloud infrastructure providers.
What all have in common is that an outage of their infrastructure can cause millions of dollars in lost revenue and/or lost customers and can even mean the difference between ending the year in the black or the red.
Data Centers Will Go Down
This universal truth is independent of location. It is true whether the solution is running on-premises (or colocation centers) or in the cloud.
The on-premises solutions have dedicated hardware, but it is too expensive to maintain redundancy with sufficient geographical separation. Public clouds mitigate this to some extent by providing multizone availability; but since there are shared services, they too still do go down — and on average a couple of times a year.
What our customers continue to learn, often painfully, is that if your solution depends on several services and/or components, and most are, it just requires one of them to go down for your application to be unavailable. Your availability is only as good as the weakest link.
Backups Are Not Enough
Backups are a solution for disaster recovery. However, the goal is to avoid the disaster in the first place.
With the amount of data generated on an ongoing basis, the sheer time it would take to recover a full data set in any mid-size to large company causes nightmares for the operations staff. Recovering from a backup is absolutely the last resort.
Today’s world is about business continuity. Applications need to be available all the time with little to no disruption to operations.
The hard requirement is to be able to fail over to another provider — and this could be another public cloud vendor or another on-premises destination.
The Data Store Makes the Difference
So, now it comes to choosing a data store. With so many out there, how do you make the decision?
The following list outlines our recommendations based on hundreds of discussions with customers, each with a slightly different set of objectives, resources and capabilities:
Avoid Application Lock-In
Vendor lock-in limits your options by definition and contains the architect to backup-and-restore strategies. Avoid it, not at all costs, but nearly all costs. Vendor lock-in requires you to keep your application on a single vendor platform (public or private) and eliminates the option for cross-cloud failover. Your application workload needs to be portable — in other words, make sure it is built on de facto industry standard APIs. It is easy to get lulled into complacency when things are going well, but that is the time to invest in solutions for when they go wrong.
History is filled with such examples.
Avoid Cloud Lock-In
As challenging as it is to move from one vendor to another, it is even more challenging to move to another data center, whether it be a private or public cloud. It is equally challenging even if you don’t want to switch but only want to add another cloud.
Now is the time to revisit your application to prepare for the inevitable. This may require the ability to run on a competitive cloud or run that same workload on-premises. Either way, the initial building blocks are S3 and Kubernetes. While S3 needs to be your starting point, Kubernetes will make it seamless. Develop those muscles.
Promote Hardware Flexibility
Business continuity that takes into account lock-in invariably creates hardware heterogeneity. This means selecting a data store that can seamlessly operate across infrastructure whether at a service provider, colocation and/or on-premises data center. This also requires the ability to move workloads that depend on a diverse collection of drive types, including NVMe, SDD and HDD. Here again, object storage has an advantage beyond commodity hardware. Erasure coding reduces the number of drives by at least 50% from traditional stores like HDFS without sacrificing any of the availability.
Build for Holistic Portability
When an application moves, the data store is not the only item in the stack that needs to move.
Application portability starts with containerization. Your workload should run entirely inside the container. Yes, you need persistence, but by having the full application stack, storage included, inside the container, portability becomes infinitely easier to manage.
Choose Lightweight and Fast
Portable applications are lightweight by definition. MinIO for example is around 100MB, half of which is the graphical console. That means it can fit into user space like Redis or Elastic. Lightweight is also fast. Not always, but in the case of minimalist designs it is. With fewer compromises and CPU-level optimization, you inherit the ability to run more workloads in, you guessed it, more places.
Design for Scalability
Your data is going to grow. It is important that the performance of your storage scales accordingly. Object storage’s scalability attributes are well known and come without any performance degradation. The ability to recover to another location, even at petabyte scale, is the benchmark of a well-designed implementation.
Strive for Operational Efficiency
Having a great development experience is only about 10% of the battle. Managing and operating the production environment is the rest.
The reason consumer applications are so popular is because they are simple to use. So why are enterprise applications so challenging? It is because making something simple is actually quite complex.
Most DevOps teams spend a majority of their time managing the data store. What they really want is:
- Something simple that they can mostly set and forget.
- Something that has a command-line interface (CLI) for power users and enables automation.
- Something that has support for standard APIs so that the application can interact naturally with special connectors.
- Something that also has a graphical user interface that allows for ad hoc tasks, and enables new and junior resources to ramp up quickly.
- Something that has monitoring capabilities out of the box.
- Something that is easy to integrate with other corporate identity and access management as well as monitoring tools.
- Something that has notifications with out-of-the-box integrations with popular end points but can easily be extended.
If you want business continuity in the cloud, architect for operational simplicity.
Ensure Business Continuity With Multicloud Replication
The data needs to be available across clouds. This is the heart of the matter when it comes to availability and continuity.
Let’s start with the simple stuff. You have to be able to replicate buckets easily. This is table stakes. However, you should be able to replicate to any compatible object store. This is where a standard interface makes all the difference. For the simple stuff, you should not be locked in.
To truly enable business continuity, however, you need site or cluster replication. You should be able to set up active-active or active-passive configurations. And you should not be limited to having two copies. You should be able to replicate to as many sites as your business demands. And it should be the same process — simple and consistent.
You should also be able to set up a replication strategy that is tied to your business requirements. You should be able to choose whether the priority is real-time availability across the clusters or write performance. The former requires that it be guaranteed that the object is persisted on ALL the clusters as a single synchronous operation or the write returns a failure. In the case of the latter, the object is guaranteed to be persisted to the primary data store and the operation is queued for an asynchronous replication to all the other clusters.
The modern business does not have a day off. True data availability ensures that it does not have to.