CI/CD / Security / Observability / Technology / Sponsored

Why Chaos Engineering Isn’t Just for Operations

9 Feb 2022 3:00am, by
A chaotic jumble of jigsaw puzzle pieces

The days are largely gone when a developer creates code or an application, uploads it and then lets operations engineers take over for the rest.

With the massive adoption of highly distributed Kubernetes and microservices environments, the dynamics have shifted. Now, DevOps teams increasingly share tasks and participate in workflows previously relegated to operations or software reliability engineers (SREs).

The end result is there is much work to be done by the developer that is not necessarily tied strictly to application development. Among other things, devs will often be directly involved with testing — and, increasingly, chaos engineering with operations and other teams across a continuous integration/continuous development (CI/CD) pipeline.

“My assessment during the past two to three years is that the dynamism of cloud native is forcing other personas, such as developers, to integrate chaos engineering in their workflows” along with operations and QA teams and SREs, Uma Mukkara, ChaosNative‘s CEO and maintainer of LitmusChaos, told The New Stack ahead of ChaosNative’s annual users’ conference, Chaos Carnival, in January.

Here, we explore why and how chaos engineering involves the entire production pipeline with developer support, and how it should properly be implemented and integrated into CI/CD.

What Is Chaos Engineering?

Chaos engineering can be described as finding and fixing weaknesses in distributed applications and their interactions with different components, such as microservices and APIs, when faults are purposely introduced as experiments.

By introducing “chaos at will” through experimentation, it is possible to help avoid and to be better prepared for the eventual outcomes of failure, Mukkara said. Improvements in the meantime to recovery (MTTR) following an outage is one example of the benefits chaos engineering offers.

A fault is injected into an application, service, network or even hardware in order to induce an application or service to malfunction in some way as the first step in a chaos experiment. “It’s an art of preventing losses at large,” Mukkara said.

The second and most important part of the three-part process is steady-state hypothesis validation to see if a service works the way it should once faults are induced.

For example, transactions that a service offers should continue to maintain a certain task completion rate if the network connections are functioning at only 80% of the load in order. The experiment is used to confirm the so-called “steady-state hypothesis validation.”

The third part of chaos engineering consists of observability. “This involves a lot of monitoring systems for business-critical services, and when you introduce chaos, you are able to see if there is sufficient recovery so that the service is maintained in a viable way,” Mukkara said.

During a Chaos Carnival conference talk, Henrix Rexed, senior staff engineer for Dynatrace, showed how observability plays a major role in chaos engineering. With the use of LitmusChaos and other tools, he showed how Prometheus and Dynatrace are useful for gathering the required metrics for Kubernetes clusters.

“You need the right level of observability for chaos engineering,” Rexed said.

The Developer’s Role in Chaos Engineering

A developer’s role in creating applications for Kubernetes and microservices environments can cover a number of tasks, involving both direct and indirect access to clusters. With a CI/CD workflow, for example, the developer might use Jenkins CI to regularly change and commit their code for peer review.

Once reviewed and approved, the code is merged with Git and the developer updates the Kubernetes YAML file to reference the latest artifacts or images built with Jenkins.

The developer might undertake more testing, such as determining how the application will function in a highly interdependent environment with microservices. In case of failure, the application is delegated back to the development team.

The application or new feature update might also fail in production, and Kubernetes lets you roll the application back to a previous version. In both cases, the developer will likely need to be able to troubleshoot and monitor logging and performance data for Kubernetes to fix the code.

The developer’s tests not only involve testing how the application performs in the stack, but how the entire stack with the new code interacts with other services in different environments, APIs and interfaces. This is where chaos engineering begins to become relevant.

Testing is typically limited to gauging the performance of a single component or service.  Chaos engineering, on the other hand, extends beyond traditional testing. It involves the validation of a dependent component required to deliver a service, such as an app or a combination of microservices that run in a network, Mukkara said.

Chaos engineering also involves observing what happens when a fault is introduced in the network to see if the app or microservices continue to run as they should.

An example might involve seeing what happens when a failure is induced in a Stripe payment system API in a mock environment. The chaos experiment will gauge whether or not the service continues to function 99.99% of the time while maintaining the necessary transactional-processing speed if the service properly switches to an alternative API if the Stripe system fails.

“Chaos engineering helps to ensure that no matter what, there is no financial liability due to a network, microservices, API or another failure in the environment that could potentially interrupt a service,” Mukkara said.

The developer can integrate chaos engineering from the beginning of the development process once executable code is added to a container image and is integrated in different environments. This might be done, Mukkara suggested, by testing the code performance against failures in a Google Cloud Platform, Microsoft Azure or another network.

“Cloud native developers need to start thinking about chaos engineering at the beginning of the production cycle,” he said. “You write your code and test it before doing chaos testing to find if there are any weaknesses” once deployed in the different environments.

Chaos for All Stakeholders

While developers often have a lot to gain from chaos engineering, it should remain a team effort. Many DevOps teams might opt for developers and QA teams to conduct chaos experiments jointly, or an SRE might take the lead in the process.

Once services are deployed, operations teams will often see performance issues and solve them with chaos engineering. “It is actually difficult to define which teams must absolutely use chaos engineering,” Mukkara said. “But everybody can benefit.”

While implementing chaos engineering can seem intimidating from the outset, developers, as well as QA and operations teams, need to embrace the practice. Reduced to its essentials, chaos engineering ensures reliability.

“Chaos engineering can help DevOps teams have a feeling of safety when trying something new,” said Charlotte Mach, an engineering manager at the cloud native consulting company Container Solutions, during The New Stack’s pancake breakfast panel at Chaos Carnival.

“What happens when something breaks or we break something and do we actually get the outcome that we want to have or did something else happen?”

Chaos engineering, Mach said, “is kind of like a safety net you can give people in the beginning” of the production cycle or anywhere else during CI/CD.

Featured image by Hans-Peter Gauster on Unsplash.