Cloud Dependencies Need to Stop F—ing Us When They Go Down
We are building software faster and with more functionality than ever, thanks to an abundance of third-party cloud infrastructure offerings, APIs and SaaS tools. They are allowing software developers like us to soar.
But if these cloud dependencies go down, we go down with them. And because most vendors refuse to provide visibility into their platforms, we’re left scrambling and asking ourselves, “Is it me or them?” In short, we’re f—ed.
That’s why when we talk about the promises of the powerful cloud services at our fingertips, we also need to talk about their problems — including how vendors can stop screwing us over when they go down — and how we can mitigate their lack of visibility in the meantime.
Cloud Dependencies Are Awesome … Until They’re Not
Upstream cloud dependencies — that is, software such as Amazon Web Services, Auth0, GitHub, Twilio, etc. — are becoming increasingly popular and important. That’s because building on and with third-party cloud dependencies makes our software better. So we are increasingly turning to third-party cloud apps to power our products and run our businesses.
For example, a typical digital product might rely directly on 50 cloud products, which represent just a portion of the 130 cloud products the average digital business uses to power its entire business.
However, there’s an important problem we need to address with this innovation: Our reliability is greatly affected by the reliability of our dependencies. Let’s look at a common example of how this plays out.
Has This Ever Happened to You?
You’re on call and PagerDuty starts going off — something is clearly wrong with the core functionality of your product. You assemble an eight-person team and get to work.
Immediately, someone suspects it’s a specific third party that you have a hard reliance on. You check the status page of this service, and it says everything is fine. You have to keep searching.
Ten minutes passes, and support tickets are pouring in. All the metrics point to a dependency outage, but the status page is still green. Twenty-five minutes in and the incident risks becoming a Service Level Agreement (SLA)-violating event with financial consequences. With no obvious solution and an “all-clear” vendor status page, the team debates various ideas and doesn’t take action to remediate the issue.
Twenty-nine minutes later, the cloud vendor’s status page updates: Your colleague was right! But as usual, the status page update lacks the details regarding the exact problem. Frustrated by the delayed and insufficient update, you initiate a failover plan that you wish you could have executed with confidence sooner.
Once everything returns to normal, you dismiss the team, feeling pissed off and gaslighted by the upstream cloud dependency. If you’re so dependent on these services, why can’t you have visibility into them like you do your own software?
More Cloud Dependencies = Less Reliability
You’d think that if you rely on multiple services with 99.99% uptime you’d have a product with 99.99% uptime. But that’s not the case. In fact, when you add more services, the uptime you can safely offer actually goes down.
That’s because with each product you introduce, you introduce the amount of unreliability that product has into your product’s reliability (even if it’s incredibly small). The math depends on a few factors. It’s a composite score which you can learn how to calculate here. But in the simplest terms, if you add a hard dependency with 99.99% uptime, you need to subtract about 0.01% from the best possible uptime that your app can achieve.
And while 0.01% might seem insignificant, it adds up with every service you use. According to research by the Uptime Institute, 70% of all major SaaS outages are connected to upstream cloud dependency issues. In fact, depending on the product or application, engineering teams may find that between 25% and 70% of all alert-able incidents come from third-party cloud dependency issues.
No Visibility = We’re F—ed During Outages
So it would seem like with our reliability on cloud vendors, visibility into them would be a high priority. Unfortunately, right now that visibility is opaque at best.
What’s more, current observability tools focus internally — on first- and second-party signals — forcing us to infer cloud-service health. As a result, we can’t answer the “us vs. them” question in an efficient, timely manner.
Vendor status pages are updated manually, if they are updated at all. Consequently, status pages are updated on average 29 minutes after issues start, and they are only updated for the most serious of issues. Most of us have to turn to Twitter and Hacker News for up-to-date information on critical reliability data.
This reality results in:
- Unnecessary downtime due to prolonged MTTD, MTTI and MTTR
- Inefficient, ill-informed incident response
- Missed opportunities to avoid incidents
And we haven’t even gotten to holding vendors accountable to their SLAs.
Vendor SLA Accountability? Never Heard of Her.
Another consequence of having no visibility into upstream service health is that those vendors hold all the power around reliability data and SLA compliance. Let’s think about this:
- They define SLA to their specs.
- They communicate about issues if/when they want.
- They don’t share metrics outside of status pages that are fraught with trust issues.
- They require customers to carry the burden of proof when things do go wrong.
And if we don’t have the data to know whether they’re even staying reliable to their own definition of reliability (which could be different from yours), there’s an inherent inability to hold cloud vendors accountable.
It’s time this changed.
Software Makers Deserve Better
What do we need to achieve an appropriate level of visibility and avoid the pitfalls of dependency unreliability?
- Timely, detailed cloud dependency health metrics. We deserve a similar level of visibility into our third-party dependencies as we have for our first- and second-party software. Cloud vendors are just as critical, if not more so, than the software we write and operate. We wouldn’t dream of operating our own services without proper monitoring, so why should we do any less for our upstream dependencies? Without cloud dependency health metrics, we are stuck trying to answer the “is it us or is it them” question without full context, despite cloud dependencies accounting for at least 25% of all incidents.
- Service status specific to us and our use of a product. Outages rarely take down an entire product, for all customers, at the same time. It’s not enough to say “Some customers are experiencing elevated error rates.” Instead, we deserve to know what functionality is affected, where, for whom and, most importantly, if our account/resources are affected. Don’t make us scramble an incident response with a vague status page update when it’s not necessary.
- A single place for all third-party service-health information. Time is critical during incident response, and we can’t waste time going back and forth between 15 different status pages, Twitter, Hacker News and our internal dashboards. We deserve a real-time and complete view of every third-party service that we depend on, in a single place, alongside our app metrics, logs and traces.
- Control over what, where and how we get visibility. Vendors have all the power, and they force us to come to them and rely on an often sh–ty status-page experience. In addition to having access to service health metrics, rather than just status-page updates, we should also have access to cloud dependency metrics in the same ways we access and control our own application metrics. We deserve first-class Slack integrations, native Datadog plugins, PagerDuty integrations, webhooks and more.
And how can we hold vendors accountable for their promised reliability?
- An independent SLA authority. Vendors hold all the power with SLAs, defining them how they want, reporting violations if they want, but forcing customers to prove violations if they want a refund. This is the fox guarding the henhouse, as the saying goes. With annual global SaaS spending topping $146 billion, we deserve an independent arbiter of truth for SLA compliance, written into our contracts.
- A vendor/customer partnership built on data. Too many times, we’ve sat in QBRs where cloud vendors and their customers disagree on reliability and each uses their own data to support their belief. Most cloud vendors want to do right by their customers, and most customers want to help their vendors increase reliability. Yet everyone holds their data close to their chest, and we don’t treat each other as trusted partners. Sharing reliability data should become the norm, building a bridge toward better reliability for all.
- Community benchmarks and baselines. Do you know if your traffic to a cloud dependency is being treated the same as every customer’s? You probably don’t. Do you know how many outages to expect from a vendor over the year, or how fast they typically resolve them? Don’t bother looking at the status page for that information; it will mislead you. We don’t just deserve metrics about our own experience with a cloud service, we deserve to know how those metrics compare to others, and to the expected.
- Proactive incident communication. During incidents, vendors often know they have a problem but wait to update their status page until they have clear details of the incident, or worse, approval from a marketing executive. In these cases, a simple “we think we might have some sort of problem. Stay tuned for details” would go a long way, but it rarely happens. Similarly, vendors shouldn’t wait for customers to catch them with SLA violations rather than admitting to issues themselves.
- A w3c standard for status pages. It is hard to imagine a world where status pages don’t play a part in creating greater and better visibility into cloud dependency health. As we turn to status pages, we deserve a consistent experience so we can find the data we need, clearly and quickly. Leveraging cloud dependencies to build software is as common as using an internet browser to access them, and status pages are a fundamental aspect of nearly every cloud product out there. A w3c standard for status pages can move the industry forward and make life better for both status-page publishers and users.
A Cloud Customer Bill of Rights would have a profound effect on our software reliability practices, but this reality is a long way off. Is it possible to get this kind of visibility today?
Filling the Void Ourselves
Visibility into third-party reliability should be a critical component of our own reliability efforts. The “is it us or is it them” question can and should be answered quickly and clearly. But until our cloud vendors give us what we deserve, there are a few things we can do ourselves to increase visibility and decrease risk.
We should be treating our third-party dependencies just like we treat our internal dependencies. Know exactly what dependencies you have, include them in your service catalogs and have runbooks for them. From there, we can begin to map out the actual availability we experience in order to understand which services introduce the most risk to your own reliability.
To get this visibility, you may need to employ a canary testing strategy, scripting continuous checks against critical third-party endpoints. Simple API pings won’t tell the full story of functionality, but multistep synthetic monitoring from an observability provider like Datadog can. Or, use a dedicated cloud dependency-monitoring solution such as Metrist or APImetrics. Regardless of the approach, it is our responsibility to collect the metrics we need in order to respond well-informed and partner with our vendors to improve their reliability.
Speaking of partnering with our vendors, now is the time to build your relationship with them. Just as we should treat third-party services like internal services, we should treat our vendors as an extension of our team. Most cloud vendors want to do right by their customers. Establishing rapport can result in hands-on support during an outage and proactive collaboration before issues happen. Knowing which public cloud provider and region your dependencies rely on, and ensuring they don’t host their status page in the same place, can help you identify and reduce the risk you take on.
With expanded visibility and strong relationships with our vendors, we can:
- Evaluate and select vendors whose reliability meets our needs.
- Manage vendors toward better reliability and hold them accountable to SLAs.
- Avoid impact from third-party outages with warnings and automation.
- Reduce MTTR through a better-informed incident response.
- Reduce MTTD with direct monitoring and alerting on cloud-dependency service health.
Standing Up to Vendors
Our software is increasingly dependent on cloud vendors to function. We are in an interconnected world, and vendors need to act like it.
We pay handsomely for these services that become a critical part of our infrastructure — including the personnel who need to configure and troubleshoot it — but the transparency of these services has not kept pace with the advancements in observability tooling and reliability culture.
We should of course give these vendors some benefit of the doubt. None of them want to affect their customer’s reliability. Often, the software engineers who build the products we rely on have their hands tied. They want to programmatically update status pages, communicate quickly and share more details about their software’s performance, but they can’t because of their own observability limitations or a strict process around customer-facing reliability communication.
So we’re left asking friends and strangers on social media, “Is it me or them?” as our primary source of third-party visibility. It’s time for our cloud vendors to join us on this reliability journey so that we aren’t f—ed when they go down.