Dynatrace sponsored this post.
“To release or not to release?” is the release manager’s daily business. This is the question that drives many who develop or operate software, and is directly tied to the risk of deploying new software versions. It’s something I personally deal with on a daily basis working in the field of release management, as a product owner setting up continuous delivery, implementing DevOps paradigms and enabling other companies to become more agile.
Release managers have the hard job of making decisions on often incomplete data — and depending on the organization, there may be multiple roles (like lead software engineers, infrastructure operators, and sometimes legal authorities in the company) who take part in the decision to release or not release software, spanning up to tens or hundreds of people.
And these roles are all changing now, with continuous automation. There’s no time to manually gather all the facts necessary for decision making. Organizations that have transitioned into agile software development, DevOps, continuous delivery, or test automation, all enforce automation of any decision making — or at the very least require automation in providing all the facts about whether or not to push the release button. The automation of release processes enables and leads to increased release frequencies and progressive delivery strategies, with multiple versions running in parallel — both in pre-production and production.
The speed with which release cycles are increasing results in a need to automate as much as possible, to ensure organizations can compete in this fast-changing, dynamic environment. According to a Cloud Native Computing Foundation survey released this year, the number of daily, weekly and ad hoc releases has increased dramatically:
For most organizations, the software product lifecycle involves toolchains often exceeding 10 different tools contributing to automation. It’s not just tool spread and automation that has increased; so have the number of approvals and decision-making processes for each release. The effort of manually collecting data on this becomes the bottleneck and Achilles heel of release automation and automated software lifecycles.
Automating any decision making based on release risks is not just about the implementation of rule sets and doing some scripting by the automation engineer. It also requires domain knowledge to include considerations of what services are provided by the software, what legal service level agreements (SLAs) are set up, and how any software release could impact SLAs.
The Three Steps of Software Release Awareness and Impact Analysis
Multiple questions arise when the risk of a new software release is evaluated and there are efforts to make this process more transparent and measurable. The following is a typical checklist I use as a Release Manager:
Step 1: Do We Have a New Release? Has It Passed Staging?
- What new versions do we currently have in the pipeline?
- How far are we with specific versions within our delivery process?
- What is the changelog for the new release?
- What known bugs can we expect?
- Are we safe in terms of test results and software quality, or do we have any blockers?
Step 2: What Is the Status of the Software Currently Operated in Production?
- How much availability do current versions provide in production?
- What is the performance of current versions in production?
- How many resources do current versions consume in production?
- Do we have any release and rollout in progress in production; i.e., any load currently redirected from a previous version to a new version?
- How does the new version behave with regards to availability, performance, and resource consumption?
Step 3: What Could Be the Impact of the Release?
The above questions need to be answered prior to enabling any impact analysis. Answers about new releases and current production status help inform the following:
- What impact will the new release have on resource consumption?
- What impact will the new release have on performance?
- What impact will the new release have on availability in general?
- From a marketing perspective, could this release negatively impact our brand?
Defining the impact of a release and attempting to quantify it to ensure fair and accurate decision making is a full-time job for the release manager. Often, the pressures to release result in the release manager making a “release/don’t release” decision based on incomplete data, or under the duress of business conditions outside their control.
Defining and Evaluating SLOs for Production Monitoring
Managing the risk of releasing new software versions is tightly related to the reliability of current versions in production. Many of the operational perspectives on software service reliability are covered by a site reliability engineering (SRE) resource published by Google, with the concepts of service level indicators (SLIs) for metrics and service level objectives (SLOs) for metrics and measures with thresholds, used for defining objectives. SLIs and SLOs provide the basis for SLAs.
For the scenario of any software release, the questions arise:
- What is the risk of violating SLAs with any customers because of SLO failures?
- What is the current status of any SLOs in production?
- How many errors, failed requests, slow loading times or downtime minutes (or “error budget”) do I have left before I violate SLOs?
- How does a new software version behave regarding my SLIs / SLOs?
The evaluation of SLOs is not restricted to production and can also be applied for quality gates in pre-production or any rollout scenarios. Keptn, the open-source project for release automation, already provides automated quality gates based on SLO definitions — generating metrics used for SLOs and evaluated against any target objectives.
Evaluating SLO on New Software in Production Before It Reaches Customers
It is essential to have all the information and all your questions regarding SLOs in production answered, and data on how a new software release behaves with regards to defined SLOs, prior to making the decision to release or not to release. Also important to consider is the frequency of releases and consequently the increased amount of work needed to manually gather data for deploying those releases.
Release management, and the role of the Release Manager, can become cumbersome and a bottleneck within software product lifecycles in the changing world of DevOps and continuous delivery, if automation is not also applied to providing answers and recommendations for any data gathering and decision making. Evaluating the risk of a new software release as the basis for any release decision making, involves combining the answers regarding the current status of SLOs in production and potential release impact on SLO results. Knowing the status, content and progress of new software releases, SLO status for production, and SLO evaluation results for the new software versions — without manual effort for each release — makes the job of the Release Manager easier.
For example, if there is no error budget left in production, one easy next step could be that operation teams define backouts of any further release — which of course can be automated. However, even with a bit of error budget left for production, a release should also be stopped automatically if newer versions behave worse regarding SLO evaluations (for example, with a relative decrease in performance compared to the version currently running in production and with minimal error budget left).
New software versions staged in a release process need to be evaluated along with test-run analysis results that have been tested against SLO evaluations. SLO definitions given for production (for example, specific service requests need to return within 600ms for 95% of all requests) can be evaluated during testing phases, where degradations in performance can be detected early.
Thus, SLO violations can be detected even before software reaches production and customers, providing SLO violation root-cause analysis for new software versions. Establishing monitoring definitions in pre-production enables a safe and sound way of moving releases from staging to production, with minimal risk of negative impact on any SLOs.
The Cloud Native Computing Foundation is a sponsor of The New Stack.
Feature image via Pixabay.