What Is Operational Resilience?
The digital domain was once considered a bastion of reliability, with businesses and organizations trusting cloud service providers to keep their operations humming without interruption. However, the narrative is evolving. A series of recent incidents underscore the vulnerabilities and the far-reaching consequences of major outages.
Google Cloud Platform’s Europe-west9 outage: In April, Google Cloud’s Europe-west9 region experienced a full-scale outage that persisted for an entire day. This disruption, triggered by a fire and subsequent water damage in a Paris co-location data center, sent shockwaves through GCP, with zones and services gradually recovering over several days.
AWS’s US-east-1 outage: In June, Amazon Web Services faced a crippling outage in its US-east-1 region. An internal DNS and monitoring systems failure, caused by traffic congestion following an automatic network scaling operation gone awry, set off a chain of connection errors and retries. The impact was immediate and widespread, affecting millions of users and businesses relying on this critical region.
The scope of these incidents cannot be overstated. Businesses, schools, hospitals, government agencies and countless others suddenly found themselves in the midst of operational chaos, raising a fundamental question: How can we ensure business continuity in an environment where cloud service outages are becoming increasingly commonplace?
The answer lies in the concept of operational resilience, a strategy that empowers organizations to adapt and respond to disruptions while maintaining continuous operations, ensuring customers experience minimal to no disruption — even when the world around them is in turmoil.
As cloud service provider outages continue to rise, the need for operational resilience has never been more critical. Here’s an explanation of the intricacies of operational resilience, its importance and strategies for achieving it.
Operational Resilience: A Promise to Customers
Operational resilience revolves around the principle of continuity, where the business and its core functions persist despite their challenges. It’s a promise to customers that their experiences will be uninterrupted, no matter the disruptions lurking in the background.
The significance of operational resilience extends beyond merely keeping the lights on; it’s about delivering products and services with unwavering reliability, even during the most trying circumstances.
Risks and Challenges
Operational resilience faces many challenges, from the mundane to the extraordinary, each capable of causing disruptions. Among the risks:
Technical failures. These may include hardware malfunctions, software glitches, or issues with the infrastructure. Such failures can have a cascading effect on an organization’s ability to provide consistent services.
Cyberattacks. Cyber threats are becoming increasingly sophisticated. Attacks like distributed denial of service (DDoS) or data breaches can compromise data integrity, accessibility and overall service reliability.
Natural disasters. Earthquakes, floods, fires and other natural disasters can disrupt data centers and infrastructure, leading to prolonged service outages.
Supply chain disruptions. Organizations rely on complex supply chains. Any disruption within this web, whether due to geopolitical events or logistical challenges, can lead to service interruptions and economic losses.
Operational resilience especially matters to financial institutions, as their operations are intrinsically linked to the global economy. A region-wide outage could exert a catastrophic impact on financial stability.
If a major cloud service provider failure were to paralyze a significant bank for an extended period, it could halt millions of transactions, with an impact on consumers and businesses alike. The economic ramifications of such an event could be profound, underscoring the critical need for operational resilience in the financial sector and beyond.
Operational Resilience vs. Business Continuity
Operational resilience and business continuity are closely related concepts, but they are not synonymous. To illustrate their difference, consider a familiar analogy: a video game.
Operational Resilience: Seamless Gameplay
Imagine you’re playing a video game where you’re in the midst of an intense boss battle. Suddenly, the game crashes. In an operationally resilient setup, the game has been designed to handle such disruptions seamlessly. You press a button, and you’re right back in the action, almost as if nothing happened.
The player, in this case, represents the end users of your services. They experience minimal to no disruption in their interaction with your organization, even when challenges arise.
Business Continuity: Loading from a Save Point
Now, consider business continuity akin to a video game that’s focused on ensuring you can pick up where you left off after a disruption. When the game crashes, you need to reload from a saved point, potentially losing some progress.
In the context of business, this means that critical functions may pause briefly during an event, but there are mechanisms in place to recover and continue operations as soon as possible. Business continuity is a subset of operational resilience, primarily concerned with maintaining essential functions during and after a disruptive event.
In essence, operational resilience aims to prevent end-user disruption during unforeseen challenges, making it feel as if nothing went wrong from the user’s perspective.
Business continuity, on the other hand, acknowledges that disruptions may occur but focuses on minimizing downtime and ensuring the rapid recovery of essential functions. Both concepts are essential in their own right, contributing to an organization’s ability to navigate adversity in the digital age effectively.
Planning for Resilience
Operational resilience is a concept that extends far beyond mere customer satisfaction, transcending into the realms of economic stability and global impact. It’s a linchpin that holds the intricate machinery of modern society together.
Rare as they may be, major cloud service provider outages are inevitable — and likely to get more frequent due to climate change and other factors. Even the most reliable providers are not immune to disruption.
Hence, operational resilience calls for robust planning and proactive measures to ensure that an organization can withstand the storm when it arrives. Waiting for such rare events to occur is not an option; preparedness is the key to minimizing their impact.
Hardwiring Resilience into Application Architecture
Operational resilience cannot be achieved through words alone; it must be embedded into the very architecture of an organization’s applications. This means that businesses must make it a fundamental part of their design and strategy.
To truly ensure operational resilience, it’s vital to acknowledge the limitations of relying on a single cloud providers and the difficulties of switching providers.
Integrating resilience. Operational resilience should be integrated into the architecture of every application. Systems must be designed with resilience as a core principle. Waiting until a disruption occurs is too late; proactive preparation is the key.
Limitations of single cloud providers. Many organizations have traditionally relied on a single cloud provider for their needs. This approach has been popular due to its simplicity and cost-effectiveness.
However, the downside is that it inherently lacks the robustness necessary for operational resilience. Single cloud providers cannot provide redundancy and failover that come with multicloud or cloud-agnostic strategies.
Challenges of switching providers. Migrating from one cloud provider to another is not as straightforward as it may seem. The assumption that an application can easily lift and shift from one provider to another can be deceptive. Different cloud providers have proprietary interfaces and architectures, making the transition complex and time-consuming.
Benefits of Cloud-Agnostic Architecture
In the face of these challenges, the concept of cloud-agnostic application architecture emerges as a compelling solution. It means ensuring every component of an application is platform-agnostic.
Cloud-agnostic architecture offers a trifecta of advantages: scalability, flexibility and operational resilience. This design facilitates easy scaling to meet specific business needs, allowing for the dynamic allocation of resources. Its inherent flexibility enables the addition or replacement of various services and platforms without necessitating major code overhauls.
Perhaps most crucially, cloud-agnostic architecture inherently bolsters operational resilience by ensuring interoperability across diverse cloud service providers. Every component of an application is rendered platform-agnostic, from databases to compute resources and data storage, functioning seamlessly across various providers.
This approach not only mitigates vendor lock-in concerns but also aligns perfectly with anticipated future regulatory requirements in the ever-evolving landscape of operational resilience.
In a world where resilience is a non-negotiable asset, transitioning to cloud-agnostic architecture transcends strategic choice — it becomes a necessity.
Operational Resilience Regulations
As the need for operational resilience intensifies in an increasingly interconnected world, governments around the globe are responding by introducing regulations to ensure that critical services, especially in the financial sector, can withstand disruptions.
These regulations aim to provide a safety net, protecting the economy and essential services from the fallout of major service failures.
Some recent examples:
The U.K. The United Kingdom is at the forefront of operational resilience regulations. Through the Operational Resilience Framework, which came online in 2022, U.K. authorities have instructed financial firms to meet specific operational resilience requirements by March 31, 2025. These measures overlay governmental oversight onto organizations’ own internal strategies.
By meeting the minimum operational resilience standards, chief information officers have the flexibility to choose strategies that best suit their organization’s needs, like operating hybrid cloud infrastructure or running on multiple cloud provider platforms.
The European Union. The E.U.’s Digital Operational Resilience Act (DORA): The European Union has proposed a significant initiative called the Digital Operational Resilience Act (DORA). DORA (not to be confused with Google’s DevOps Research and Assessment Metrics, also called DORA), aims to ensure that all digital service providers, including cloud service providers, search engines, e-commerce platforms and online marketplaces, regardless of their location within or outside the E.U., have effective strategies and capabilities in place to manage operational resilience.
DORA regulations came online in January 2023; financial entities are expected to become compliant with the regulation by early 2025.
The U.S. In March 2021, the Board of Governors of the Federal Reserve System, Office of Controller of the Currency and the Federal Deposit Insurance Corporation issued guidance on operational resilience. In May of that year, the Biden Administration issued an executive order on cybersecurity that included rules related to operational resilience.
Regulations around operational resilience are expanding beyond financial services companies. The push for new regulations reflects a growing awareness of the interconnectedness of modern services.
The regulatory scope may soon encompass sectors such as utilities, transportation, and healthcare, essential services due to their critical role in daily life. Regulatory authorities recognize that the resilience of these services is vital for public welfare.