It’s been a rough couple of months for cloud service providers and the businesses that depend on them. Outages at Salesforce, LinkedIn and Twitter have disrupted sales and marketing operations. Stripe’s recent outage impacted millions in sales. Then there have been the Google and CloudFlare outages, impacting many businesses which run some or all of their critical business software in the cloud.
Even if your business isn’t yet running in the cloud, the vendors and providers you work with can still go down and bring your business processes down with them.
Accepting the Inevitable: Everything Eventually Breaks
It’s not just that there are outages with cloud service providers, it’s how you find out about the outages. You may learn that something is broken from your customers who are suddenly no longer able to access your service. Since status pages are not always up to date, you find out what’s broken from Twitter (if it’s up). Refreshing a Twitter search page while your business is losing money isn’t anyone’s idea of a good situation.
Ideally cloud services just wouldn’t go down — that’s what you’re paying for right? Unfortunately the reality is that there isn’t an amount of money that can provide 100% uptime and even getting very close is much more than most people are willing to pay.
Once we accept that cloud services will go down, we can start building both technical and business solutions to minimize that down time. We use cloud service providers to offload expertise that isn’t core to our company’s value proposition as well as leverage technical expertise. Even if a cloud service allows you not to have to run a very complex distributed system, your team still needs to build the observability that allows you to understand how well they’re doing their job. If they’re not doing their job well, and your service or business is hurting as a result, then it’s critically important that you have a hard conversation with them.
Accepting that things will break is not the same as being O.K. with poor service.
Observability: The Best Imperfect Solution
In order to understand how healthy a cloud service is, you need to be able to measure:
- Where you’re asking the cloud service to do something for you.
- How long it took to do that thing.
- Whether it was successful in doing it.
For example, if you’re keep pictures of cats from your pet store customers in a cloud object store, you’d want to know where specifically they’re being stored, where they’re retrieved, how long each of those actions took and if there was an error while trying to do it. However you should be warned that the results of these measurements can be surprising. Cloud services regularly get a little slow or have small groups of errors. Perhaps you’ve thought these were unexplained bugs in your own programs before, or just some random complaints for customers, but really they are “small outages” or, more accurately, routine failure that you have to learn to plan for just like the bigger outages.
While basic observability will transform the way you look at your cloud providers, it’s often not enough to understand the impact on your business. You could see that 1% of the requests to a cloud service are failing, but without business context you wouldn’t see that those 1% impact your largest and most important customers. This is why it’s important to look at expanding your observability to include “distributed tracing,” a tool that allows you to bring business context throughout your software’s usage of cloud services. With the context of a trace, your team will be able to see whether 1% failures are slightly annoying or a critical business problem.
Adding Resiliency to Business Processes
Another way to improve your business’ resilience to cloud service failure is to create responses to failure in all of your business processes. These responses can cover a wide of range actions:
- “I tried to use this service and it failed, I’m going to wait a second and try again”
- “This is a critical transaction for my business so I’m going to store it in three different places, always cross check them, and immediately page someone if they don’t match”
- “If our provider is down more than 15 minutes, we will change our web page and ask people to call us to let them continue to do business with us.”
Of course, the specific right steps for your business depend on what providers you use and your business processes.
Once you’ve created responses to failure, like anything else, they need to be tested. Why? Because an untested plan is unlikely to succeed. Complications happen, and people might not have the information or access needed to execute the plan. Practicing response to failure in ways that are as realistic as possible will improve the resilience of your business and your team. You’ll gain confidence in your ability to handle diverse situations with grace. (See: Chaos Engineering )
The end result of implementing observability and failure testing for your cloud providers is confidence that the business impact of the next major outage will be minimal. By using distributed tracing, you’ll immediately be able to understand the business context of each failure, so you make informed and measured choices that increase customer’s confidence in your business. And when the next cloud outage hits, you’ll be ready to manage — and learn — from the process.