Chaos Engineering Is Not Just for Ops
When we think of chaos engineering, we first think about ops and site reliability engineers (SREs). Many also say it is for some SREs managing highly scaled operations like Google and Netflix. Though chaos engineering started as a solution for fixing unknown problems at scale, it has evolved in recent years into a totally different practice area. It is now beginning to play a major role in continuous integration/ continuous delivery (CI/CD) apart from ops and as an aid that improves developer experience. Chaos frameworks are beginning to feature in the list of must-have dev tools.
In this article, we discuss the role of chaos engineering in stepping up the cloud native developer experience.
In cloud native computing, applications are expected to be resilient, loosely coupled, scalable, manageable and observable. Because of containerization, there is a proliferation of microservices, and they ship quickly. Microservice environments are more dynamic. In such an environment, making applications resilient means deploying the applications in a fault-tolerant manner, but it also means building the application to sustain the faults happening on dependent (upstream/downstream) services and continuing to take appropriate action against the incidence of such a failure. Similarly, quality assurance teams should cover all the fault scenarios to be covered during the CI and CD process. Eventually, ops teams must continue testing the service for resilience in production by practicing chaos engineering. Continuous verification can be and should be practiced at all stages of the product life cycle.
Chaos Engineering for Developers
Cloud native developers follow Curly’s law while developing microservices. This enables modularization and faster shipping of applications, but also necessitates creating a set of well-defined conditions in the code to handle various microservices faults, the API responses and the underlying platform such as Kubernetes. While Kubernetes plays a major role in enabling microservices architecture, it also brings certain assumptions that developers should be aware of. For example, the frequency with which pods are evicted or moved around nodes is several times greater then, say, VMs getting moved across in an ESX cluster. In a scaled environment, pod eviction can happen at any time, depending on the load conditions and other environmental factors, but the service should continue to work just fine.
A host of chaos tests can be performed during development to build resilience against Kubernetes faults and common cloud native infrastructure application faults. LitmusChaos, which is a cloud native application, provides all the capabilities to practice end-to-end chaos engineering. Its chaos experiments are declarative in nature, and cloud native developers can add or execute chaos experiments in a cloud native way. The core of Litmus operates on the basis of chaos CRDs (custom resource definitions), which makes the practice of chaos for cloud native developers very natural.
Chaos Engineering in CI
Cloud native CI pipelines throw additional requirements for QA teams. They need to test the applications against various features of the cloud native platform functions such as Kubernetes. Kubernetes environments can present various faults, such as at the pod, node and service level. In a chaos-integrated CI pipeline, chaos experiments are designed to cover all fault scenarios of Kubernetes and other cloud native stack components in addition to specific application faults
Apart from being able to execute chaos experiments easily for various different scenarios, quality assurance teams can measure the performance of resilience metrics of the application from build to build. Litmus is such a tool, where resilience metrics are easily compared against the chaos runs against different versions of the system under test.
A few examples of how Litmus chaos experiments can be easily invoked from a CI platform such as GitLab are provided here. Similarly, Litmus users are invoking chaos experiments on other CI platforms such as GitHub Actions and Jenkins.
Chaos Engineering in CD
Modern CD platforms run based on the service-level objectives or SLO verification. SLOs are designed to validate whether the functionality, stability and resilience of the service are working as expected. Chaos experiments are selected to execute simple and multilevel faults on a CD pipeline. Often they are used as a gating mechanism to deploy in subsequent environments such as pre-prod and production. Chaos experiments are executed, and the SLOs are measured during this period. It is also a common practice to execute chaos experiments immediately after applications are upgraded on the target environment to validate the service’s continuous resilience. The Litmus project has integrations into a few CD platforms such as Keptn, Spinnaker and Gitlab. Litmus also has CD integrations through GitOps for ArgoCD and FluxCD.
In summary, chaos engineering and tools such as Litmus can be used in dev environments, CI pipelines and CD pipelines to continuously verify the resilience of an application, a set of applications or a service. This builds confidence in DevOps and prevents complex and expensive bugs from leaking into production.
Chaos Engineering Made Easy
LitmusChaos is simple to deploy on your Kubernetes cluster or namespace. Litmus is hosted at ChaosNative Litmus Cloud as SaaS and comes with a forever free plan. Sign up at litmuschaos.cloud for a super-quick start on chaos engineering, or follow the Litmus docs to install Litmus on your own Kubernetes cluster.
Chaos Engineering Community Conference
ChaosNative hosts an annual chaos engineering conference called ChaosCarnival where global experts, enthusiasts and practitioners share their experiences, best practices and success stories. Register for free to get updates on the conference.