TNS
VOXPOP
Where are you using WebAssembly?
Wasm promises to let developers build once and run anywhere. Are you using it yet?
At work, for production apps
0%
At work, but not for production apps
0%
I don’t use WebAssembly but expect to when the technology matures
0%
I have no plans to use WebAssembly
0%
No plans and I get mad whenever I see the buzzword
0%
Cloud Services / Operations / Security

Paris Is Drowning: GCP’s Region Failure in Age of Operational Resilience

The time is coming, and maybe sooner than we think, when regulators will require a standardized approach to resilience in the name of public good.
Apr 27th, 2023 1:15pm by
Featued image for: Paris Is Drowning: GCP’s Region Failure in Age of Operational Resilience

Google Cloud Platform’s europe-west9 region outage is precisely the type of service failure that keeps the world’s government officials up at night. Their deepest concern is for the potentially catastrophic impact a major cloud provider failure could have on financial institutions — and the very real world problems and pain this would cause for their economies.

This concern is increasingly turning to action as different countries begin proposing technical requirements aimed at ensuring operational resilience for their financial institutions, and eventually other critical services like utilities, healthcare and transportation.

Trouble in Paris

The first Google Cloud Platform incident message advising trouble in GCP’s europe-west9 region went out on April 25 at 19:00 PDT: “We are investigating an issue affecting multiple Cloud services in the europe-west9-a zone…Customers may be unable to access Cloud resources in europe-west9-a.”

Initially, GCP advised customers to fail over to other zones within its Paris-based europe-west9 region while its engineering team investigated the issue. Over the next few hours, updates indicated that “water intrusion” in europe-west9-a led to an emergency shutdown of some hardware in that zoneand continued advising failover to other zones in the europe-west9 region.

That is, until 23:05 PDT: “A multi-cluster failure has led to an emergency shutdown of multiple zones. We expect general unavailability of the europe-west9 region. There is no current ETA for recovery of operations in the europe-west9 region at this time, but it is expected to be an extended outage. Customers are advised to failover to other regions if they are impacted.”

Cloud provider service failures are not all that uncommon. They are also generally limited in scope and over with before most users even notice. Yesterday’s GCP event, though, is a textbook-definition worst-case scenario: It was not just a quickie zone outage, but an entire cloud provider region failing.

The failure of an entire region for GCP (or AWS, orAzure) means that all of the data centers in a cloud service provider’s particular geographic region, in all of its availability zones, have gone offline. (Basically, regions and zones are cloud service provider terminology for the underlying physical resources provided in one or more physical data centers).

When a full region failure occurs, all of the services hosted in that region become unavailable to users, including that cloud provider’s own platform services. Businesses may find their applications, services and websites are unavailable to their customers, leading to lost revenue and grumpy customers. Businesses that rely on a cloud provider for critical infrastructure services, such as data processing, machine learning or storage experience disruption in their operations that may have delayed projects and decreased productivity, at the very least.

The Danger Is in the Data

“When cloud regions fail, like we saw with the europe-west9 region, if you haven’t thought ahead of time about how you’re replicating your data, it’s so easy to end up in a situation where your application is hard-down,” said Jordan Lewis, senior director of engineering at Cockroach Labs. “In that scenario, there’s really nothing you can do besides wait for the cloud provider to do their best to pick up the pieces.”

Beyond the initial crisis of downtime, though, there is also the long-tail potential damage to data integrity. A whole region going dark can lead to data loss or corruption, particularly when appropriate backup and recovery processes are not in place. This happens because data is usually replicated across multiple zones (remember that zones are logical representations of physical data centers within a region). So if all the data centers in a region fail simultaneously, there may not be a backup available to restore the data.

In addition, if the outage results in data loss or corruption, which might not be immediately recognized, businesses can face the risk of legal liability, data breaches and compliance violations, to name but a few potential negative consequences. And of course, any of these could result in significant financial penalties or damages.

Operational Resilience as Mandate

Today’s GCP europe-west9 region outage is precisely the type of service failure that increasingly is turning concern into action as different countries begin proposing technical requirements aimed at ensuring operational resilience for their financial institutions.

The UK is leading the way in holding financial firms responsible and accountable for their operational resiliency.

“Financial market infrastructure firms are becoming increasingly dependent on third-party technology providers for services that could impact the financial stability of the UK if they were to fail or experience disruption,” said UK Deputy Governor for Financial Stability John Cunliffe in a joint announcement made by Bank of England, Prudential Regulation Authority (PRA) and Financial Conduct Authority (FCA) describing potential resilience measures for critical third-party services.

One of the keystone requirements: Regulators have instructed financial firms to meet operational resilience requirements, inserting governmental oversight into what used to be internal decision-making. So long as the results meet the required minimum level of operational resilience, CIOs are able to choose from scenarios that best suit their needs. Hybrid cloud (operating an additional physical data center to supplement their primary cloud infrastructure) and multicloud (running on multiple cloud provider platforms) are two of the options for satisfying these requirements.

Similarly, the European Union’s Digital Operational Resilience Act (DORA) seeks to establish technical requirements that ensure operational resilience in the face of critical service failures. As such, it is expected to apply to all digital service providers, including cloud service providers, search engines, e-commerce platforms and online marketplaces, regardless of whether they are based within or outside the EU. DORA entered into force in January 2023; with an implementation period of two years, financial entities will be expected to be compliant with the regulation by early 2025.

Ultimately, the full impact of GCP’s europe-west9 region outage will depend on the severity of the outage, its duration and the impact on critical services and data. Time will tell. But no matter what the fallout from this particular region failure, it is vivid validation of the reality that, while serious cloud provider outages are uncommon, they are also basically inevitable.

Operational resilience used to be viewed as part of business continuity planning, to be handled privately by individual companies. The time is coming, and maybe sooner than we think, when legislators and regulators will act to standardize the way individual companies approach operational resilience, all in the name of public good. Organizations need to start re-evaluating their tech infrastructure to ensure operational resilience is hardwired into their application architecture. If the countless cloud outages that have occurred over the years have taught us anything, it’s that this should no longer be a consideration, but a requirement.

Postscript: 36 hours after the initial outage and incident report, details are beginning to emerge. Apparently, a cooling system water pump failure caused water to accumulate and leak. That in turn is said to have flooded the data center’s battery room and caused a fire. It’s not immediately clear whether the data center’s fire suppression system then caused the “water incident” that took the entire europe-west9 region offline yesterday, or the actions of firefighters to contain the blaze. It looks like GPC now has two out of three zones back up, but europe-west9a is out for the foreseeable future.

Group Created with Sketch.
TNS owner Insight Partners is an investor in: Pragma.
THE NEW STACK UPDATE A newsletter digest of the week’s most important stories & analyses.