Recent high-profile system outages have brought the importance of system resilience and redundant architecture to the forefront of industry discussion. While business continuity considerations are not new, the technology employed has evolved. Where enterprises once primarily provisioned backup data centers, they now have a variety of layers to consider when architecting resilient applications and infrastructure.
When applications and websites are unavailable, revenue and reputation suffer. But our increasing reliance on digital systems has extended the definition of resilience beyond outages and service disruption, to also include performance and application delivery — which are equally important. End users today expect the applications and services they use to be responsive. A lag of even seconds is too long.
Enterprises looking to build and maintain resilient applications and infrastructure should consider these seven recommendations.
While some may be tempted to go “all in” with a single cloud or CDN provider, this approach can result in costly downtime if the provider goes offline or experiences other performance issues. Companies that diversify infrastructure by using two or more providers with distributed footprints can significantly reduce latency by bringing content and processing closer to users. And if one provider experiences problems due to network congestion, geographical restrictions, resource availability or other issues, automated failover systems can ensure minimal impact to users.
Consider Implementing Microservices
The emergence of newer technologies, such as microservices and containers, ensures that resilience is at the forefront for application developers. As enterprises move away from monolithic applications run in physical data centers, to microservices and individual applications that are widely distributed, they must address early on how these systems interact with one another. And redundancy is built-in during the design phase of microservices. This is why enterprises already undergoing digital transformation, or working toward upgrading their systems, should consider employing a microservices approach.
As organizations grow, they can see different parts of their systems come under stress before others. Microservices and non-monolithic applications enable them to scale those specific components independently. When employing microservices, they may see partial failures due to certain components of the system, but entire outages are rare.
Build Redundancy Into the Code Base
Enterprises can address resilience from a software development standpoint by building redundancy into their code. A global streaming provider uses this approach so that if one of its cloud providers fails, its home-built system will be activated to keep them online. Similar strategies are often employed by e-commerce companies, where even minutes of downtime can result in significant profit loss. Chaos engineering experts at Gremlin estimate that 10 minutes of downtime for Amazon would cost the e-commerce giant $2 million in revenue. As a result, many e-commerce companies often have their code written in such a way that applications are run in data centers as part of their backup/redundancy strategy. The shopping cart application may run slower in this environment, but a slow shopping cart is better than no shopping cart.
Introduce Chaos Engineering as a Practice
Chaos engineering, the practice of intentionally introducing problems to identify points of failure in systems, has become an important component in delivering high-performing, resilient enterprise applications. Intentionally injecting “chaos” into controlled production environments can reveal system weaknesses and enable engineering teams to better predict and proactively mitigate problems, before they present a significant business impact. Conducting planned chaos engineering experiments can provide the intelligence that enterprises need to make strategic investments in system resiliency.
Adjust Traffic Routing Policies
Companies can minimize risk of downtime and latency by implementing traffic routing strategies that incorporate real-time data about network conditions and resource availability with real user measurement data. This enables IT teams to deploy new infrastructure and manage the use of resources to route around problems or accommodate unexpected traffic spikes. For example, enterprises can tie traffic steering capabilities to VPN access, to ensure users are always directed to a nearby VPN node with sufficient capacity. As a result, users are shielded from outages and localized network events that would otherwise interrupt business operations. Traffic steering can also rapidly spin up new cloud instances to increase capacity in strategic geographic locations, where internet conditions are chronically slow or unpredictable. As a bonus, teams can set up controls to steer traffic to low-cost resources during a traffic spike, or cost-effectively balance workloads between resources during periods of sustained heavy usage.
Define SLAs and Monitor System Performance Continuously
Enterprises should monitor their applications and systems to get ahead of performance fluctuations, outages or other problems. Monitoring the health and response times of each part of an application is a key aspect of system resilience. Measuring how long an application’s API call takes, or the response time of a core database, for example, can provide early indications of what’s to come and allow IT teams to get in front of these obstacles. This approach also includes creating service level agreements (SLAs) for different sub-applications and systems, and then monitoring those to ensure they remain in line.
Getting Started with New Systems and Applications
Enterprises looking to add resilience to their IT stack should start when implementing new applications or services that have less direct impact on the business. While some may be tempted to add resiliency to a core service or application first, this approach can result in costly — and more damaging — downtime should things go awry. The IT staff can learn from addressing resilience in new systems first. Perhaps an organization is launching a new support portal. Testing new approaches to resilience on this service will have less risk and can allow for some hiccups. Later, IT teams can use their learnings on other business-critical systems and services.
As organizations take a closer look at their approach to resilience, they must consider the costs vs benefits of each strategy. These seven recommendations require investments in additional services and architecture, as well as time from IT teams, which companies should carefully consider before determining the best course of action. Regardless, they should prioritize resilience as a best practice to ensure high availability and optimal performance for their digital applications and services. This is imperative to keep business moving forward and maintain a competitive advantage.
Feature image via Pixabay.
At this time, The New Stack does not allow comments directly on this website. We invite all readers who wish to discuss a story to visit us on Twitter or Facebook. We also welcome your news tips and feedback via email: email@example.com.