Progressive Delivery for Distributed Systems with Canarying, Service Meshes and Chaos Engineering
There’s an evolution in the role of the developer — from code monkey to creative worker. It’s a good way to motivate and retain some of the most in-demand roles, by making the job more creative and less tedious. But this movement is a response to our increasingly complicated and distributed systems: these roles have to be automated to allow developers to focus on more problem solving and systems resiliency.
In this episode of The New Stack Makers podcast, we talk to Jason Yee, developer advocate at monitoring and analytics service Datadog, about the role of progressive delivery in distributed software development.
One of the ways to help developers think differently through progressive delivery is how they release updates to the software. RedMonk’s James Governor dubbed this approach progressive delivery, which takes advantage of the different ways you can now route traffic to a specific subset of users before deploying a change more broadly, limiting the blast radius of testing in production.
One of the first progressive open source framework was Capistrano, which could run scripts on multiple servers. From there, it was done via rolling deployments, where instead of releasing to everyone at once, you limit downtime by releasing to only one server at a time by creating a lot of replicas. Then came blue-green deployments — although Yee says what color you name it doesn’t matter, and it’s sometimes called red-black — where you maintain two nearly identical production environments at once and, like a rudder, you progressively shift traffic from one to the other, shifting back when things go awry.
What Yee says really progresses this kind of segmented delivery is canary deployments. Canary’ing slows everything down and allows you to measure and change based on what you learn. He later says in this episode that you have to manually take control over any releases within Kubernetes, which can be tedious.
We also spoke with Yee about what Datadog measures. The company took a page from Google’s Golden Signals for monitoring distributed systems, covering latency, traffic, errors and saturation. This approach helps identify outliers and anomalies. He gave the example of traffic going up, which could be a sign of more users or it could be a red flag that you have duplicate code putting extra strain on your systems.
Yee also noted that service meshes are an excellent way to gain more control and visibility over the network. Google’s own open source service mesh is called Istio. It allows you to set up automation for timeouts, encryption, retries, role-based access control, traffic routing and more. It also allows you to customize the system’s responses to certain behaviors.
Finally, we discuss chaos engineering, which Yee described as thinking about the resiliency of our systems — “If this breaks in a certain way, what will we understand about our systems?”
Datadog itself conducts regular game days where it kills a certain service or dependency to learn what threatens resiliency. These game days are partnerships between the people building whatever’s being tested — as they know best and are initially on-call if it breaks — and a site reliability engineer. This allows the team to test monitoring and alerting, making sure that dashboards are in place and there are runbooks and docs to follow, making sure that the site reliability engineer is equipped to eventually take over.
In this Edition:
0:35: The role of technology evangelist.
6:18: Metrics tied to the role of the technology evangelist.
13:02: Blue-green testing.
24:17: How to conduct a canary test with Kubernetes.
28:08: Istio and service meshes.
32:45: Chaos engineering.
Feature image by Krzysztof Niewolny from Pixabay.