CI Observability for Effective Change Management
Featured image via Pixabay.
Per the Theory of Constraints, every system has a limiting factor standing in the way of systematic progress. This is more profound than ever when thinking about software development pipelines. We can no longer see our development practices as part of separate domains. Instead, we need a more holistic approach. Every step in releasing software to production is interlinked, and this means an expanded range of constraints.
Moreover, with the iconic shift left that the industry is experiencing, the importance of development and deployment is increasing. Adding the rise of DevOps to the mix, it becomes clear that continuous integration/continuous deployment (CI/CD) especially is becoming more prominent in our discussions. This is because CI/CD is where we transition from dev to ops.
Hence, at this stage constraints are often majorly disruptive to the entire development flow, affecting practices that would otherwise be considered independent from the CI/CD stage. One such practice is change management, a concept often overlooked or considered too impeding to adopt, but that proves to be greatly effective in the development life cycle.
Therefore, this piece introduces change management and connects the CI/CD stage of development to an ever-left-shifting world of development. I will also introduce flaky tests, a major concern in the CI/CD stage and explain how this constraint acts as a bottleneck undermining the change management process, which is already a bottleneck in the overall product’s development.
Change Management, Friend or Foe?
Change management is the process whereby a human-side change to a system is ideated, scheduled, verified and pushed to production. The goal of the process is mitigating potential disruptions while making systemic changes. In contrast, incident management aims to mitigate the impact of disruptions to production.
Generally, this process is centered around change requests, which capture the desired changes to a system. These changes are usually then scheduled and reviewed with the risk of the change assessed. Once the change request is approved, it is then deployed according to schedule. After deployment, the change’s impact is assessed, and rollbacks are initiated if necessary.
Organizations often overlook change management best practices. This is because change management is abstract and often believed to slow down deployment in production. As a result, much of change management is watered down to basic pull request approvals where the code itself is checked without considering the impact of the change holistically.
In fact, at the time of writing this article, major companies such as British Airways and HSBC experience outages of their websites due to an erroneous config change by Akamai Technologies, which was being used to provide content delivery network services. With a more concrete and proactive change management practice, this could have been avoided.
However, it is understandable why companies are not too crazy for full-fledged change management practices and why relying on the basic PR approvals is considered enough. Traditionally this practice is quite manual and involves considerable involvement from stakeholders in assessing whether a change should go or not.
As a result, the value of change management is often dismissed for the ability to increase overall velocity while leaning on incident management capabilities to deal with potential disruptions. However, this thought process, when thinking of the system and performance holistically, is counterproductive for the overall goal.
There are new solutions coming into the market to solve some of the pain points, but these solutions are limited within the change management domain. To understand the overall possibility of constraints, we must consider the entirety of the system, especially the CI/CD stage that fills into the change request object for software changes.
CI/CD and Change Management
As discussed above, the most basic version of change management can be stripped down to approving pull requests. Throughout this, we realize that CI/CD plays a vital role in the concept of change requests. There may be solutions out there such as Atlassian’s JSM or ServiceNow that aim to empower the change management process. However, as seen from the current state of the industry, companies would rather do away with the entire process for the benefit of velocity.
Of course, it is not wise to completely dismiss change management, but maybe we can simplify the process and mitigate any adverse impact. This is where we must look toward CI/CD, inadvertently jumping on the shift-left train.
The ideal case is that developers working on the code base create a pull request that is then approved and added into the master code base, resulting in a change in the overall system. The question is then, what conditions does a PR need to satisfy for approval and execution?
Often, approvers are looking at code sanity and validating whether this change will result in disruption. This mainly concerns the impact on direct services and dependent services. We can also consider stylistic errors and programming formats.
Certainly, there are other aspects that approvers can consider, such as business impact within a certain timeframe, rollback effectiveness and others. However, the core checks of potential disruptions and code compliance can be incorporated into the CI phase through various testing and linting tools.
By ensuring rigorous unit tests, integration tests and end-to-end tests along with any linting checks as required, we can use the CI phase to automate much of the core of change management. This is something that we already see being executed when talking about GitOps.
Admittedly, testing in the CI phase is not an easy task. One of the main barriers to overcome is the culture of writing tests. Over time, however, this culture can be incorporated into any organization. With the right set of tools, test culture within the CI phase can be facilitated. Although, another factor requiring more than cultural change is confidence in CI tests and builds. If there is a lack of confidence in what is created by the CI phase, then tedious manual change management becomes necessary.
This confidence is mostly affected by test coverage and the tests themselves. Test coverage is primarily related to culture. On the other hand, tests and their success or failure to communicate potential harm depends on those who write tests and how the CI tool executes tests, and this is where a major problem arises: The phenomena of flaky tests.
If we no longer know why a test succeeds or fails, then confidence in the test wanes. As a result, in the desire for increased velocity, tests that actually fail may be dismissed as flaky tests. Overall, this would reduce confidence in the CI stage and pass responsibility over to change approvers. This then results in a tedious change management process, something that we were hoping to avoid in leveraging the CI stage.
The solution to this issue is therefore CI observability.
CI Observability For Better Change Management
In an attempt to ditch conventional and strenuous change management practices, relying on CI tests is crucial, but as discussed above, we need to ensure confidence in the way we perform these tests. CI observability, a process not new but borrowed from traditional observability and monitoring of applications, aims to provide insights into the black box environment of how tests perform.
By tracking various metrics such as quality and time-based metrics, and while leveraging metrics traces and logs in testing and debugging scenarios, we can effectively do away with the woes of traditional CI.
Therefore, with these metrics we can actually list the major benefits:
- Building trust in the CI/CD stage across teams with metrics that provide a ground reality status and understanding.
- Providing insights crucial to the resolution of failed and flaky tests.
- Reducing the risk of incidents and disruptions in production due to an added layer of debugging.
- Building resilience in the CI/CD stage and overall DevOps pipeline.
By incorporating CI observability practices, teams can reinstall the required confidence and reliability into their CI processes, allowing organizations to leverage much of the CI/CD for change management purposes, reducing the need for a separate change management process to validate software changes.
Change management is a crucial process within any organization to ensure better availability. However, traditional change management practices are often seen as impeding velocity and are therefore overlooked.
However, if we consider the core of a change request associated with code or config changes, we note that a lot of what needs to be validated in the change management process can be done within the CI/CD stage. By writing an encompassing set of tests and setting a proactive testing culture, automated CI tests can take away much of the pain of validating change.
Nevertheless, issues such as flaky tests hinder the reliability of CI tests. This is where CI observability needs to be introduced as a solution.
To read more about CI observability check out Thundra CTO Serkan Ozal’s blog post on the subject.