Thundra sponsored this post.
Chaos engineering is the act of introducing failures into systems to test their response and discover flaws before they cause downtime. It’s an evolving practice that allows engineering teams and developers to take ownership of their software’s reliability.
In practice, an application that hasn’t been effectively and routinely examined is more prone to downtime, which can lead to loss of customers. This 2020 survey by ITIC revealed that 87% of organizations require a minimum of 99.99% availability. Four in 10 enterprises also reported that a single hour of downtime can cost them between $1 million and over $5 million, excluding fines and legal fees.
Chaos engineering was introduced by Netflix in 2011, when the company’s streaming and online rental services were run on on-premise servers, leading to massive outages and server failures. In response to these problems, Netflix decided to migrate from a physical on-premise infrastructure to a distributed cloud-based architecture running on AWS, to support increasingly resource-hungry and complex activities — like expansion of its customer base to 100 million users in over 190 countries.
Although the migration had advantages, it introduced new complexities and the need to develop fault-tolerant and reliable systems. At that point, the Netflix engineering team built a suite of open source tools called the Simian Army, for checking the resilience, reliability, and security of their AWS infrastructure against all kinds of failure.
Principles of Chaos Engineering
In chaos engineering, you run planned and thoughtful experiments that generate new knowledge about a system’s performance, properties and behaviors in the event of a failure.
The following points summarize some principles you need to follow when running chaos experiments:
Define Metrics for the Steady State of Your System
To successfully run chaos experiments, you need to define the metrics that indicate your application’s behavior in normal conditions. A system’s steady states depend on its use case and purpose. Hence, a good understanding of the steady states will enable you to track, monitor and properly understand how your system works when it encounters a bug.
When you define your system’s steady states, business metrics are more functionally useful than purely technical metrics, because they provide more granular details about an application’s health. They’re also more suitable for measuring customer operations or experience. For example, Netflix uses “streams per second” to evaluate how often their users press the play button on a streaming device. Other examples of business metrics are the number of declined transactions per minute, searches per hour, number of failed logins per minute, and the number of logins during a peak period.
Minimizing the Blast Radius
When you run chaos experiments in production, you will likely experience unexpected system outages and negative customer impact. Because system failures are inevitable, you’ll need to ensure that the negative impacts of chaos experiments are contained and minimized.
Continuous chaos experiments allow you to automatically identify system failures and enable you to spend more time implementing new services and features. Doing one-off experiments is a great way to start, but to continuously build confidence in your system, it’s advisable to run your chaos experiments continuously.
Scaling the Blast Radius
Chaos engineering isn’t about causing outages, but about learning how your system behaves under failure. Hence, you need to follow a granular approach when injecting failures. This means injecting a small failure, examining the system output and impacts of the failure, and noting your observations. If there are no observations, increase the chaos and, consequently, the blast radius. By scaling the blast radius, you can further identify system failures that relate to real-life system behaviors.
Running Chaos Engineering Experiments
Production systems are bound to fail, but chaos engineering helps you develop applications that can cope with unexpected events and inevitable disasters.
Below are the steps to follow to effectively run a chaos engineering experiment.
- Formulate a hypothesis.
To successfully run chaos experiments, you need to make some realistic assumptions about how your system will behave when it encounters unexpected events or failures. The best way to develop your hypothesis is to discuss how the app should react to unexpected changes with all those involved in its development and operation.
You can kick off the brainstorming session by asking several “what if” questions and allowing everyone on the development, support engineering, and operations teams to come up with several scenarios that could affect your system’s steady state. By sitting with your team and whiteboarding your dependencies (external and internal), data stores and services, you can create a picture of what could go wrong in your system.
- Inject realistic failures and bugs.
Your chaos experiments should reflect likely and realistic scenarios. Injecting real failures during your experiments will help you get a good sense of what technologies and processes need an upgrade. For instance, you can proactively inject events that correspond to realistic software failures (like malformed messages and responses), hardware failures (like server crashes or scaling events), or non-failure events (like traffic spikes).
- Measure the impact.
To fully comprehend how your system behaves under stress or the changes in its steady state behavior when it encounters a bug, you need to analyze your experiment’s outcome on the system. You should measure the impact of the failures on key performance metrics that correlate to customer success. Examples would be requests per second, orders per minute, or stream starts per second.
- Verify or disprove your hypothesis.
After running chaos experiments, you’ll either discover a problem that needs to be fixed, or verify that your system is resilient to your injected failure. Both of these outcomes are good; they will increase your confidence in the entire system’s capabilities, or uncover problems that you need to remediate before they cause an outage in production.
Since chaos engineering is mostly about formulating a hypothesis and then verifying or disapproving it, if you obtain as many details as you can about your system, you can make predictions based on known vulnerabilities.
Integrating Chaos Engineering into CI/CD
Even though automated CI/CD pipelines enable fast product iterations, provide standardized feedback loops for developers and reduce the chances of manual errors, they can’t predict all of an application’s failure modes. Therefore, organizations need innovative solutions that help them discover an application’s vulnerabilities and understand how it performs when a component(s) is affected at build-time. This is where chaos engineering intersects with DevOps.
By integrating chaos engineering into CI/CD pipelines, you can build better antifragile applications and ensure that reliability is baked into every component of your system. When you break things on purpose and test how a system works under stress, you can detect application failures and fix them before they cause a costly outage. This will also lead to fewer repeat incidents, faster mean time in response to high-severity incidents, improved system design, and the development of more resilient systems.
Netflix has already integrated chaos engineering into their CI/CD pipelines. The company developed ChAP (Chaos Automation Platform) to overcome the limitations of FIT (failure injection testing) and increase the pace, breadth and safety of their experimentation. They use FIT to build more resilient systems by propagating failures across the entirety of their system, in a controlled and consistent way.
At a high level, ChAP automates experiments and interrogates the Netflix deployment pipeline for a user-specific service, launches both the control and experimental groups of that service, and routes a little traffic to each group.
If the results exceed a predetermined error budget or threshold, ChAP will end the automated experiment to prevent catastrophic damage. Netflix also integrated ChAP with Spinnaker, an open source CI/CD platform built by Netflix and supported by Oracle, Microsoft and Google. This allows engineering teams to run experiments continuously, using ChAP to identify unexpected interactions, CPU-intensive fallbacks, and mistuned retry policies between load balancers and circuit breakers.
Microsoft also uses automated fault injection techniques and chaos engineering principles to increase confidence and resilience in the applications they deliver to customers, the products they ship, and the services they make available to developers.
Ultimately, the need to integrate chaos engineering into CI/CD pipelines will only grow as customers rely increasingly on functional systems, threats become more sophisticated, and room for error shrinks. By using chaos engineering and fault injection, developers can measure, understand and improve application resilience. Architects can build confidence in their designs, and operations teams can also validate new data centers and hardware before they roll them out for customers.
Thundra Chaos Injection Feature
Using chaos engineering, developers and engineering teams can build distributed business-critical or high-availability systems.
Thundra uses chaos injection to incorporate chaos engineering into services. This feature allows you to proactively inject failures into your applications to simulate your system’s failures and see how they affect your system. Thundra gives you tools to run chaos engineering experiments on modern architectures and test your architecture’s resilience even before any issue occurs. Thundra currently supports chaos injection in Python, Node.js, and Java.
Feature image via Pixabay.