Why Chaos Engineering for Jenkins Is Easier Than You Think
Chaos engineering can represent both fear and hope for the developer. For those who have not yet implemented chaos testing in their continuous integration/continuous delivery (CI/CD) pipelines, the fear might be that chaos engineering will introduce software-release delays.
Chaos experiments, which introduce failure into mock or real production environments to test resiliency, might be seen as more testing — which, at its worst, might slow developers down. The experiments might also be seen as potentially adding more complexity when committing code to the Jenkins pipeline, when a Kubernetes environment is already enormously complex.
However, developers also don’t want to test how their apps run in an isolated environment, only to then have them crash or not work properly in production environments after committing the code to Jenkins.
With the right tools and processes in place, chaos engineering can directly improve software deployment speeds, and can reduce failure rates and mean time to recover (MTTR) when applications crash in production.
During ChaosNative’s annual users’ conference Chaos Carnival 2022, Akram Riahi, a cloud builder for WeScale, a data-access platform provider, showed how chaos engineering works with a Jenkins pipeline, using open source LitmusChaos as the framework.
How to Inject Chaos into a DevOps Pipeline
Traditional tests at the developer stage of CI are used to “look for things we already know that we have — we also have the right to be surprised sometimes by problems we don’t know about and we don’t expect in order to improve the app resilience,” Riahi said. “For that reason, we have to enable developers to inject chaos in their DevOps pipeline as often as they want.”
Riahi showed how developers can “easily inject chaos” via a simple pull request on a Jenkins pipeline. As he described, it is not really that complicated to do.
“The question here really isn’t difficult: do we have enough knowledge to do chaos? How can you deal with it on a daily basis knowing you are pushing a lot of code that has to be resilient?” Riahi asked. “Can we make it easier for developers or [site reliability engineers] to do that? Well, the answer of course is, yes.”
In addition to a Jenkins pipeline using LitmusChaos as the framework used to inject chaos via a pull request, the deploy environment in the demo consisted of Amazon Elastic Kubernetes Service (EKS) with Terraform.
Slack was used for alerts since, as Riahi described, it is highly advisable to communicate to all DevOps team members that the process of injecting chaos can create some performance problems or even failures with a blast radius in a live production environment.
GitHub hooks were used to trigger the Jenkins pipeline. For monitoring and observability metrics, Grafana and Prometheus were used for the monitoring stack.
During the demo, LitmusChaos was applied once the pipeline updated the app with the new image. LitmusChaos then applied the workflow designed for the chaos experiment in the Kubernetes environment.
The results generated were either pass or fail. In the event of a failure, notifications are automatically sent via Slack, prompting the developers to get back to work again to build more resiliency into the app.
Chaos experiments should not be feared but embraced by developers, especially when things fail. “With chaos engineering … we are going to get a lot of failures,” Riahi said. “And we don’t have to be afraid of them because they are instructive” to ultimately build more resilience into applications.
For more details, check out Riahi’s demo here: