Monitoring / Technology / Contributed

Failover Conf: Ensuring Resilience in the Face of Uncertainty

16 Apr 2020 8:20am, by

Gremlin contributed this post in anticipation of Failover Conf, next Tuesday, April 21.

Andre Newman
Andre is a technical writer for Gremlin where he writes about the benefits and applications of Chaos Engineering. Prior to joining Gremlin, he worked as a consultant for startups and SaaS providers where he wrote on DevOps, observability, SIEM, and microservices. He has been featured in DZone, StatusCode Weekly, and Next City.

Over the course of a single month, our lives have been completely transformed. Office buildings now sit empty, conference halls are dark, and spending time with co-workers means starting up a video call. The COVID-19 pandemic is a stark lesson in just how quickly the world can change, but it’s also an empowering reminder of how resilient and adaptable we are, especially in a world connected by technology.

At Gremlin, our mission is to make the internet more reliable through Chaos Engineering. We’re always thinking about how things can fail, and this applies to people-based systems just as much as tech-based systems. When municipalities started issuing stay-at-home orders, this had little impact on our ability to work. But many organizations are finding themselves in an entirely new and unfamiliar environment. There is a real risk of organizations failing if they can’t adapt to this new normal.

When we think about resilience, what we’re really talking about is the ability to recover quickly from adversity, change, or other problems. People and processes that are resilient can quickly adapt to new and unusual situations, and it’s this “real-world resilience” that can make or break a company in the face of adversity.

With that said, how can we create this level of resilience, especially since we’re already in the middle of a pandemic? Here are some tips:

Always think about failure. As a company obsessed with reliability, we’re always thinking about how our systems and processes can fail. This is how Failover Conf came about. When event cancellations started flooding our news feed, we realized that the developers, professionals, and enthusiasts who looked forward to these events now had nowhere to go. We quickly executed on a plan to create a virtual conference, and with the help of the Chaos Engineering community, brought it to life. While it’s impossible to plan for every contingency, even the best-laid plans can fail quickly, so it’s important to have a fallback strategy.

Cultivate resilient people. We often talk about Chaos Engineering as it applies to technology, but this is only part of the picture. Resilient systems are nothing without resilient people and processes. This is why we have things like FireDrills, where we cause incidents to see how our teams will respond. As long as our people are quick to adapt to new circumstances, we can feel much more confident in our ability to address adversity. COVID-19 is adversity on an unprecedented scale, but consider the resilience shown by the doctors and nurses working 24-hour+ shifts, or the companies pivoting their assembly lines to create PPE, or the SREs working to keep tech services up and running. Society depends on people who can make quick, strategic decisions about how to operate during unexpected events, which is why it’s important to have them on your team.

Focus on what’s important to your customers. Part of being resilient means recognizing the limitations that failure can impose, and working creatively within those limitations to continue providing the best possible customer experience. With Failover Conf, for example, our goal isn’t to recreate the in-person conference experience. We realize that what the community values in a conference is the ability to talk to other people, to learn from experts, and to collaborate. Fancy venues and swag are nice, but they’re not the point. Same goes for companies: if you can’t provide your full service, identify what’s important and focus on that.

Always, always, always test your systems. As we’ve seen this year, disaster can come at any time and in any form. We never know what might happen, so we need to ensure our systems can operate in even the most extreme circumstances. At Gremlin, we believe in eating our own dog food, so we use our own product to test our systems and make them more reliable. Even now, we’re using our Chaos Engineering tools to prepare our platforms for the surge in demand we expect to see as attendees log in, stream video, and run experiments throughout the day.

Learn from your experiences. Once the dust has settled, reflect on the event and your response. Consider what impact it had, how prepared you were, how you responded, and what you might have done differently. This is often easier said than done, especially for an event that’s still ongoing. But the best way for us to improve is by learning from past experiences and planning for them in the future, and sharing these experiences as a community helps us all grow faster.

Conclusion

We look to Chaos Engineering as a way to harden our applications and infrastructure against failure. But at the end of the day, the technology is only there to support people. If we aren’t resilient as individuals, teams, companies, and communities, then even the most resilient systems don’t have much value to us.

Failover Conf serves two major purposes: provide a platform for the Chaos Engineering community to come together in the wake of conference cancellations, and also show how quickly we as a community can quickly adapt to new circumstances and create something amazing.

Whether you’re an expert who’s been practicing Chaos Engineering for years or just starting out, we invite you to join us on April 21 from 8 a.m. to 5 p.m. PDT.

A newsletter digest of the week’s most important stories & analyses.

View / Add Comments

Please stay on topic and be respectful of others. Review our Terms of Use.