The Rise of Progressive Delivery for Systems Resilience
In the complicated world of distributed systems, what separates the elite performers from the rest? These are the ones that are deploying all the time, but not breaking. These are the Netflixes and Expedias of the world that successfully commit thousands of deploys a day without user disruption. What do they have in common?
There are certain practices shared by the few, the proud that are working so fast yet still don’t stop working. Each company has its own mix of chaos, canaries, and colorful code tests that keep their continuous delivery from cutting off customer experience.
Today we dive into what progressive delivery means, who are already mastering it, and how to adapt this to your business.
Progressive Delivery: Delivering New Value in Progression
Progressive delivery is the next step after you’ve shifted testing left, automated load-testing and deployment, and committed to DevOps and CI/CD (continuous delivery/deployment and integration) — or even ideally it’s a part of that journey.
Governor says CI/CD is the onramp to everything good in modern software development, but argues that some of the associated disciplines of the early pioneers haven’t got the attention they should. With sophisticated service routing, it becomes easier to adopt experimentation-first approaches like canarying, blue/green deployments, and A/B testing which slow the ripple effect of a new service rollout.
Progressive delivery routes traffic to a specific subset of users before being deployed more broadly, so that testing in production doesn’t have to be a massive risk.
For Governor, progressive delivery is really progressive experimentation that spreads until it reaches the entire user base without — or hopefully without — a degradation of user experience.
“Progressive delivery is continuous delivery with fine-grained control over the blast radius.” — James Governor, RedMonk
He continued that the building blocks of progressive delivery are:
- User segmentation
- Traffic management
Instead of rushing to continuously deliver, it’s taking a small step back to release to a smaller portion of users in a way that increases the quality of the broader release.
CloudBees’ Principal Software Engineer Carlos Sanchez describes progressive delivery as the next step after continuous delivery, “where new versions are deployed to a subset of users and are evaluated in terms of correctness and performance before rolling them to the totality of users” or rolling it back when it fails to meet key metrics.
A recent study by Subbu Allamaraju of Expedia showed that about two-thirds of their outages happen when something is changed. So how do you test in prod to only a small amount of people?
The following trends are under the umbrella of progressive delivery:
- Canary testing* — This is when you release a change to a small amount of users, which means only a small percentage of users is impacted if you release a bug.
- Blue-green deployments — Like a rudder, you start rolling a new or updated service out to one production environment and then shift from one to the other. These two environments stay as identical as possible, but only one is live at a time, allowing you to test things in production but quickly roll back to the other one if something goes wrong.
- A/B testing — A high school science experiment meets marketing, where you test a variable on two different groups of data on the user interface and then see which performs better. This is good to test the effect of a change on usability, anything that can affect your conversion.
- Feature toggling — Also called feature flags, feature toggles allow you to hide, enable or disable a smaller feature during run time, hidden (“toggled”) from the user interface.
- Service meshing — A service mesh can be placed between containers and services, and it will remember what worked last, which is helpful to retry a call if it fails or to revert to the last response available. It allows you to canary route new functions to specific user bases, as well as to perform failover, with an ability to orchestrate services with advanced service routing and traffic shifting.
- Observability — Governor says observability combines tracing, logging and metrics, which allows developers to build new services with a strong view into how they will be managed in production. He offered Honeycomb as an example of a company demonstrating the state of the art in system polling, metrics and problem resolution.
- Chaos engineering — It’s the art of embracing failure by defining what a normal state for your distributed systems is and then throwing the kitchen sink at that through empirical resistance testing.
Traffic shadowing can also be lumped into this set of actions, however this act of asynchronously copying everything in production into a non-production service for testing purposes is inherently not actually in production.
What Is Progressive Delivery in the Real World?
All of these methods, which can be applied in varied combinations, share the understanding that deployment isn’t the same as release.
“Deploying a service is not that same as activating it for all users.” — James Governor, RedMonk
He offered the example of Comcast which has 30,000 customer service agents. The company wouldn’t necessarily want to roll out a change to affect every agent because that would then affect millions of customers. They’d probably prefer to provide some training while they roll out a change little by little.
In this sense, progressive delivery is about business and technology aligning.
“IT has been the anchor in our business preventing new services getting to our customers, but that’s starting to change. I grew up with the notion that IT is the brake on the business, it prevents the business from acting, but in high-performing organizations, now IT may no longer be the brake, the realities of business are,” Governor explained.
DevOps has us moving faster but we need to find a way to install new brakes or at least slow-downs into our processes, ideally tuned by the developers that are now in charge of deploying their own code.
“Most people don’t like application changes. If Google changes the Gmail interface, of course, we don’t like it, but they let us choose it at our own pace,” he said, probably referring to the end of Inbox reminders we’ve all received lately.
Slower rollouts with consistent reminders like this are more successful and more acceptable to a majority of users.
He reminded the audience that the idea of “continuous delivery is absolutely terrifying for businesses,” pointing out that the idea of “debugging in production sounds insane.”
Canary testing, A/B testing and blue-green deployment all still allow testing and debugging in production, but affecting a much smaller user base.
Governor pointed out that this is the logic that goes behind Amazon Web Services having 42 availability zones in 16 autonomous regions worldwide. From the start, AWS built hard boundaries around these zones to encourage compartmentalization so they can support high levels of availability, with zones acting as backups for each other.
He said you actually want firebreaks — this isolation allows everything to still move fast but more stable.
Which Users Make Up Your Experimental Group?
Different companies will decide who to roll which services out to first. Like all experimentation, it varies based on your own factors and users.
Bruno Kurtic, vice president of Sumo Logic machine data analytics, chooses to canary five percent of their users first, and then leverage logs to understand the behavior of both the systems and users in response to this change. They also run shadow testing in production.
Kurtic explained on the RedMonk blog:
We have a number of machine learning techniques we expose as capabilities to customers, for example. A customer might complain our pattern recognition isn’t working. But how do we know if we change the algorithm for other customers it won’t break their experience? We can silently spin up two clusters and test the performance of this algorithm. We do candidate testing of each service we roll out. How do we test it? We have a shadow copy of Sumo Logic we use for testing, industry regulations etcetera.
Cloudflare Web performance and security applies a menagerie of dogs, canaries and pigs. Dogs are the loyal pet customers who you want to treat well and never roll out anything to that could break first. They canary in tech-friendly cities of medium sizes like Oslo and Munich. And they dump all at-risk changes on the pigs who have signed up for accounts using stolen credit cards.
GrubHub food delivery service chooses to run canary deploys on smaller cities first. This is probably because these smaller cities have not only less traffic, but fewer competitors vying to bring you your takeout.
It will vary by company and tool which user groups you experiment on. Just remember things like timezones and activity peaks when you are deciding who and when.
And of course, you have to decide what broken really is for your users. Most commonly applied red flags are Google’s Golden Signals:
- Latency: the time it takes to service a request
- Traffic: including HTTP requests per second
- Errors: rate of failed requests
- Saturation: load measurement
Governor mentioned the following tooling to help perform progressive delivery, with particular awareness to those golden signals:
- WeaveWorks: a Kubernetes operator that automates the promotion of canary deployments using routing for a traffic-shifting tool
- SwitchIO: helps people understand error patterns in clusters
- Petri: experimentation with A/B testing and feature toggles
- LaunchDarkly: for “feature management” with automated feature toggles depending on user personas
Governor says learning from your incident response is key to progressive delivery. It’s a necessary part of the worldview of a business.
“We are awful at understanding the psychology of change management and helping people understand the value of adopting an approach.” — James Governor, RedMonk
This is why progressive delivery is about aligning user experience with developer experience, and making sure the whole organization understands this is the best for business.
*Author’s Note: I think the term canary testing — named after the bright yellow bird that would die in mines to signal toxic gas — is an outdated and unnecessarily cruel term that makes it sound like some users are acceptable casualties. I propose we call it “Pollacking” making an artistic metaphor to strategic splatter painting.