Gremlin sponsored this post.
Modern computing systems are complex and in a constant state of change, as a direct result of adopting cloud native technologies and distributed system designs. These technologies and designs save money and can add resilience through automated responses to changing system conditions, but at the expense of sometimes hard-to-predict possibilities for failure. Gartner’s recently-released report entitled Innovation Insight for Chaos Engineering highlights this, beginning by comparing today’s systems to pinball machines. From here, the report moves into the practical aspects of implementation and usage, for effective identification of systemic failure modes to help you improve reliability.
We know that not everyone has access to Gartner research. We’ve been given permission to pull out key quotes, starting with the pinball analogy, that help clarify concepts around Chaos Engineering and reliability (all text in italics is taken directly from the report throughout the article).
Chaos Engineering and Predictability
Pinball machines have existed since the 1800s, first as highly mechanical games, later coin-operated and in the 1930s the games became electrified. The pinball games themselves frequently have a hero character or theme, and frequently there are goals or regions for scoring within the playboard that follows the character’s motive. This all sounds simple, yet you will never have the exact same experience when it comes to operation and play two games in a row. In this way, we can easily consider the pinball machine as a complex and deterministic chaos system.
Why is it then we expect digital systems with ownership and complexity far beyond a nostalgic game to provide such consistent experiences? We are working with deterministic (diagrammed and documented) and chaotic (unpredictable) systems, and we need to test them as such. Collaboration and teamwork will be key in the success of this endeavor. It is very likely that neither the development team nor the operations team will have total understanding of code constructs, application and system dependencies, resilient architectures, monitoring and automated remediation technologies. These are all inputs and parameters for consideration when crafting the attack plan.
Chaos Engineering is necessary because modern systems are chaotic and unpredictable. With services and nodes appearing and disappearing according to system load, we can never accurately state precisely what our architecture contains. We can guess and approximate, but not state with certainty.
The pinball analogy works because you never play the exact same game twice. You can learn how to use the flippers more effectively, aiming your shots and scoring bonuses, but even good pinball players will confirm that there are no guarantees. Sometimes the ball slips between the flippers and sometimes the entire machine lights up and makes fantastic noises, even when you thought you acted exactly the same way.
Likewise, our systems never exist in a perfectly identical state compared to last week, yesterday, even just a few minutes ago. Data may travel a different path for two different users, because the load balancer sent their requests to different processing nodes. How can we possibly test our systems in this chaotic atmosphere? Traditional unit tests don’t help us. Neither does a stress test or even good integration testing.
We can only discover how our systems work by using Chaos Engineering, which safely recreates failure mode conditions, so that we can observe how the system responds. This prevents us from needing to learn these lessons by surprise, through spontaneous failures that tend to cascade into larger issues like outages. For example, to know how the system will respond to a database that is receiving a higher-than-usual number of requests per second, or how a load balancer will respond to a compute node that suddenly stops communicating due to excessive CPU load, our best option is to simulate that with Chaos Engineering.
Testing your system with Chaos Engineering is how you find out whether your self-healing, autoscaling design works as intended to mitigate expected problems. It is also how you find failure modes you didn’t anticipate so that you can make your system resilient to those previously unknown issues. This is how we build reliability.
Chaos Engineering Requires Focus
Focus your teams’ efforts on understanding user journeys, their experiences and downtime drivers, to enhance system reliability and minimize friction.
Support your team with time and resources to evaluate the resilience and reliability of your system through proactively engaging in chaos engineering safely and with a test-first approach within your pre-production environment.
Implement fault tolerance by identifying components and systems that are central to failure, attacking and improving their safety mechanisms.
Chaos Engineering works best when it is used systematically. What we mean is that testing and chaos experimentation is done in a methodical manner. It is not itself chaotic, but seeks to rein in failure modes that result from system chaos and constant change.
We begin by thinking about our system. Where are the known weak spots? Which parts of the system cause the most user-facing issues? What can we do to make our customers happier with our system? How can we hone in on those issues one-by-one and design experiments that help us discover precisely what is going wrong, but in a way that doesn’t cause real problems?
What we do is design experiments with a limited blast radius, intentionally restricting which parts of our system can be impacted with the failure injection tests we are about to perform. We also limit the magnitude of the failure we are about to cause, keeping it at a level we think will be informative to us but below the level we think will cause any problems across the system.
As we gain experience, learn things about our system, fix and mitigate problems, and implement reliability-focused features, we can increase the scope of our testing beyond the original blast radius and magnitude, learning more about our system.
It is no longer adequate to design and believe we have proper system safety and reliability. We have to test it to be sure. As we confirm that things like autoscaling are working well, we will frequently find issues and failure modes serendipitously along the way. This added bonus makes Chaos Engineering even more valuable, as we can reduce or prevent failures that we never expected or anticipated. This is how we create fault tolerance.
Social Aspects Matter When Implementing Chaos Engineering
It is important to recognize the practice as one of social engineering, as well as reliability engineering. Do you question the knowledge silos within your team, or the extent or adequacy of your system documentation? These concerns can be exploited using chaos engineering to either put the concerns in the past, or build plans that will suffice learning and knowledge needs of the future. From this perspective, chaos engineering works to build trust into teams, knowledge and systems.
The idea of Chaos Engineering as an aspect of social engineering and reliability engineering is solid. A good implementation of the practice will help find gaps in things like incident runbooks or playbooks, discover which people on teams know things that have not been documented and could cause single points of failure, and build cross-team and intra-team trust as knowledge is spread and absorbed. As explained by Lenny Sharpe, Director of IT Resiliency Engineering Enablement at Target, “doing Chaos Engineering with Gremlin has helped us break down knowledge monopolies and validate our runbooks, resulting in dramatic improvements to our incident response times and production environments.”
Everyone gains when everyone learns more about how the system works and what to check first when something goes wrong. It really isn’t enough to hand a new junior a pager and tell them they are on call for the next 24 hours when they have just started. Chaos Engineering can help you test your runbooks and is a great way to train newcomers.
Having an experienced engineer inject failure into the system similar to the types that have happened in the past and having your junior engineer use a runbook to learn how to fix that failure, is a great way to build muscle memory while enhancing system understanding. Events like these are called FireDrills, because they teach us to stay calm in emergency situations and to act with wisdom and efficiency.
Similarly, running a GameDay helps us build confidence in our system, as we poke and prod and test using failure modes that simulate past events — specifically problems we think we have fixed and that should not be able to happen again. We can’t be certain our mitigation schemes are valid until we test them out, to see if the system is now reliable under the same conditions that caused it to fail before.
Trust Is Vital to Chaos Engineering Success
Trust remains the largest factor in the adoption of chaos engineering, as an existing foundation to build from, but also to be acknowledged and cared for as a cornerstone of company culture. This started with a chaos engineering vendor who not only provided a tool, but helped American Airlines establish trust in the practice. The vendor they selected was on-site for training, when they ran the first chaos attack, as well as one of their disaster recovery (DR) days (a day that the IT leadership sets aside for all work to be dedicated to DR improvements). Their vendor relationship has gone beyond buying the tool to helping build expertise.
Regardless of whether you use Gremlin or another Chaos Engineering solution, it’s important that you not only get a great tool, but also great service. Chaos Engineering is a practice and a discipline, and it’s not the best idea to go out and break things in production if you haven’t built up good habits and muscle memory.
We learned quickly after launching in 2017 that we needed to offer more guidance, which prompted us to start a community, provide expertise as part of our offering, and build more templates into our product (e.g. Scenarios).
The Gremlin tool has a clear web UI, a useful CLI, and a powerful API. You can run experiments on one node in your system, or multiple nodes. You can easily limit the blast radius as you select from a wide variety of possible fault injection attacks. All of this is combined with the ability to stop any experiment instantly using a big red Halt button or API call, making it safe and easy to roll back testing if a larger-than-expected problem appears. We’ve also announced Status Checks that will automatically ensure your systems are healthy and ready to be experimented on — these are all features that we believe help build trust with our users.
Finally, we have founders who have been doing Chaos Engineering for over a decade, from places like Netflix and Amazon. We’ve hired SREs and Chaos Engineers from places like Google, Dropbox and Uber. We are committed to running GameDays and FireDrills with our customers, to make sure they not only have the proper tools to do the job, but the proper know-how as well!
Feature image via Pixabay.