It’s no surprise continuous delivery has become a popular theme in software engineering and operations — that is, the ability to deliver software updates in hours rather than months. But actually making it happen? That’s another story.
Most teams today realize that manual approaches don’t cut it — the scale of modern software development is too large and the stakes are too high. So, they look to automate software delivery.
This is where most teams get stuck because they look to automate software delivery with scripting. That’s a mistake. They get trapped in a mindset where “DevOps” means “Build it all yourself, on your own, with no help from anyone.”
But that’s not actually what DevOps means. DevOps is about the speed of delivery — getting to high velocity and staying there — so you can get code to market faster than your competitors.
As a result, scripts is the wrong way to automate, because it’s too time-intensive and manual. Saying that scripting equals automation is like saying you drive a sports car when in reality you’re just power walking.
Once you create artifacts from code, you have all of these other dependencies to manage, like secrets, environments, release strategies — not to mention verification and rollback if something fails.
Scripts don’t scale because there’s way more to continuous delivery than spinning up a few compute nodes and copying files. Once you create artifacts from code, you have all of these other dependencies to manage, like secrets, environments, release strategies — not to mention verification and rollback if something fails. Throw in governance, audit trails, access control, approvals, reporting, and before you know it, there’s far too much complexity for scripts to handle reliably.
In today’s cloud-native landscape, the worst thing you can do is build a shaky house of cards when what’s needed is true automation.
At Harness, we find that unsupervised machine learning in highly-structured environments is the best way to automate away the risks and complexities of continuous software delivery.
As an example, consider virtually any large brand-name bank in business today. These organizations are no longer simply financial institutions; they’re also software factories. If they have 300 developers creating code, they’re also likely to have 100 DevOps engineers who spend most of their day overseeing and supervising production releases. That’s 100 smart, expensive engineers who should instead be spending their time on more strategic initiatives. No bank wants to have so many people monitoring each release for performance spikes, ready and waiting to initiate a rollback — least of all when machine learning can automate these critical, yet tedious and error-prone tasks.
To illustrate how this approach to automation works, consider the process of verifying production deployments — something which typically takes tons of manual effort and elbow grease. First, you want to set up the environment with preferred data sources, connectors, and webhooks. In most cases, all of the data required to required to verify the deployment already exists in various toolsets:
- Application performance monitoring (APM): AppDynamics, New Relic, Dynatrace
- Infrastructure monitoring: Datadog, CloudWatch and Nagios
- Log monitoring: Splunk, Scalyr, ELK and Sumo Logic
- artificial intelligence for IT operations (AIOps)/IT Operations Analytics (ITOA): Moogsoft and BigPanda
- Synthetics: Selenium
Next, you build connectors and webhooks to integrate with some combination of the above toolsets and observe the application data, metrics, and KPIs surrounding every deployment. To help illustrate, here’s an actual deployment pipeline in Harness:
Finally, it’s time to apply unsupervised machine learning to automate the process of analyzing time-series metrics and event data from these sources. Leveraging unsupervised machine learning is important: to truly perfect the continuous verification process, the algorithms must be constantly learning from the application’s baseline performance — both in in terms of how the app is “supposed” to be running, as well as what happens to performance after each release. There’s simply too much complexity involved for a human to manually input all of variables. What you want is the algorithms learning on their own and getting better every time they do the work.
Ideally, the algorithms soon have the ability to automatically verify deployments and quickly identify any regressions, anomalies or failure, which may have been introduced. Furthermore, if they identify unacceptable performance spikes occurring a release, the algorithms can initiate a rollback and revert the application to the version that existed prior to the latest release.
How Build.com Actually Did This
Build.com is a large-scale online retailer based in Chico, California–the second-largest seller of home improvement products next to Home Depot. While not operating at the scale of a large bank, its engineering group nonetheless used to devote a team of seven senior DevOps engineers to carefully oversee each production release. It was critical that they improve and enhance their app without disrupting their customers’ ability to shop online since even one minute of downtime could potentially mean losing hundreds of thousands of dollars in sales.
The team embarked on a continuous delivery project that included production deployment verification and automated rollbacks. They used monitoring tools (New Relic) to help their platform identify anomalies in time-series metrics, and logging tools (SumoLogic) to identify unwanted deviations from standard performance that could occur after each release, or “custom events.”
The grey dots in the chart below represent “baseline events” or “clusters” — these are events that the machine learning algorithms have learned over time, classifying them as “normal” because they are observed frequently during deployments. The red dots represent unknown events or events that have an unexpected frequency. In other words, issues that can cause serious problems during a production deployment.
During the verification process, the algorithms start to analyze, compare and flag anomalies/regressions from the thousands of log entries and time-series metrics that both tools capture from their application. Within seconds of detecting these regressions (represented above by the red dots), Build.com is able to perform a “Smart Rollback,” taking the service/application back to the last working version (artifact & run-time configuration).
What kind of results is Build.com seeing from automating verification and rollbacks?
- Rollback time at Build.com has decreased from 32 minutes (the time it took using engineers and custom scripts) to 32 seconds.
- With confidence that the scary and unpredictable parts of the release process have been carefully automated, Build.com now devotes a single junior engineer to each release instead of seven senior ones.
- The company estimates that it saves 750 team-lead hours per year.
This is true automation. The guesswork is removed, the manual work of verifying production deployments is off the table, and the engineering team can save a mountain of people-hours a year.
With the pressure to produce more and ship faster, automation is a key goal for any DevOps team — and that means getting away from the mountain of scripts for verifying production deployments and rolling back failures. Scripts will always have their place, but for only the small, less critical application functions — not the ones that the ones that make the difference between a successful deployment and a late-night war room full of blood and tears.
Feature image via Pixabay.