“Chaos Engineering is the discipline of experimenting on a system in order to build confidence in the system’s capability to withstand turbulent conditions in production,” the “Principles of Chaos Engineering” site explains.
That’s a pretty unnerving concept to those of us who grew up around the refrain: If it ain’t broke, don’t fix it. And nothing strikes fear into the hearts of security, compliance, and c-suites like “experimenting on a system.”
Chaos engineering is about increasing reliability in increasingly unreliable, unpredictable, distributed systems. We know it’s not about if something happens, it’s about when. So what can we learn from our systems ahead of time?
Still, you need to convincingly convey the value of chaos experiments to your engineers early on as it is almost always taking time away from their regular work.
The Psychology of Chaos Engineering
“Chaos engineering is about building a culture of resilience in the presence of unexpected system outcomes,” wrote Nora Jones, CEO and founder of the incident analysis software provider Jeli as well as co-author of the O’Reilly book on the subject.
For her, tools are just a means to an end to support this goal. Chaos engineering is really about forming a culture of experimentation.
“Chaos engineering isn’t about breaking things. It’s not about trying to figure out how we can make things fail. Or engineering being chaotic. Or processes unpredictable,” she said.
She says it’s always following the same process of creating a hypothesis around measurable outputs in steady-state behavior, then proving them to be so through experimentation.
“Chaos engineering is an effective tool to learn more. Resilient organizations are always learning,” she said.
One of the myths surrounding chaos engineering is that chaos experiments have to be in production, but, as Gunderson says, not all systems are set up for that. Staging will never be the same as production so, yes, testing in prod is always better, but it’s risky.
And as Gunderson said, referring to the goal of five nines or 99.999% of uptime, there are still humans on the .001% of users. And testing in production at the start is a difficult sell to many higher-ups.
“You need to design your experiments and have a hypothesis and you also have to think of the organizational culture, but at some point, you just have to jump in — whether that’s at staging or in production,” she said.
Gunderson offered the example of how Microsoft releases software, creating a concentric ring blast radius:
- People who worked on product
- Other teams in Microsoft
- Then all of Microsoft
- Only then consumers
“As they learn more, that software is being released and they are able to expand that blast radius,” she explained.
As you start to reign chaos in production, perhaps only test on one percent of your users.
“We have to remember we have users who are using our product.” She said, “Someone is out there having a bad day and can have an even worse day.”
Gunderson says, even after seven years running chaos tests PagerDuty still works to incentivize people to participate in and embrace chaos engineering.
To make the best of the time, the testing team has a one-hour meeting to plan ahead which failures will be introduced and, to mitigate risk, they make sure everyone on the team is notified ahead. It’s important to be cognizant that your coworkers are trying to do their jobs while you’re injecting change.
“This isn’t about getting people used to stressful situations — be transparent — your incidence response team needs to know experiments are running,” Gunderson said.
Chaos and the Day Job
In many organizations, there is still a struggle with how to balance chaos engineering with feature development, said Jason Yee, a developer advocate for chaos platform provider Gremlin, in another talk.
“Someone complained that every week for two hours his team was not doing work, was stopping and planning and doing a game day and analyzing results. Another complained that they had so many Jira tickets from the first game day, that it could fill an entire sprint,” Yee said.
He grouped the barriers to chaos experimentation into three areas:
- Lack of time — You want everyone in your organization to care about reliability, but everyone doing two hours every week is a lot of time
- Lack of process — Yee realized they need well-defined processes with runbook automation — like FireHydrant, Blameless, Rundeck — and docs to speed up everything.
- Lack of priority – Game days usually result in high-priority tickets. How do you include this new unplanned work into your existing sprint cycles without derailing the planned work and without affecting delivery dates?
Yee’s team decided on a process that should make game day time most effective as possible.
They decided to rotate engineers, using the “water-cooler” Donut Slack app. Gremlin adapted this to auto-create groups of three engineers at a time, ensuring every engineer has an opportunity to do a mini-game day. Donut notes the blocks of time when all of them are free and automatically sets up a zoom call for them.
They also created a Google form that has all the steps for the chaos experiments in them — fill it out as you go. In the end, there’s a section for a suggested experiment that the next team can use.
Finally, they had to make a choice for what to prioritize after game days without risking product timetables while also leaving engineers with a sense of being impactful. They ask: What’s the one thing that you can do that would make the biggest impact on reliability? The small chaos team can usually accomplish that in the current sprint.
Yee said that when you roll out chaos engineering, it’s never going to be perfect and you will come across some resistance. To get more widespread engineering buy-in and so it’s not too expensive, it’s important to keep the chaos concise. Then you can start automating it. Clear processes and automation makes execution and reporting much easier.
If there are SRE superheroes, does that make Chaos Engineers anti-heroes? We break things & cause pain, but ultimately do good. 🤔
I kinda want to be SRE Venom! 😂
— Jason Yee, but ✨actually a purple-haired goose✨ (@gitbisect) February 16, 2021
Chaos Comes with Conviction — and Illustrations
His “Chaos with Care” talk focused on how to initiate and then grow a chaos practice, pairing it with other engineering practices and safety in your systems.
Chesser says most engineers have the habit of wanting to start anything by showing off the shiny new tools. Instead, he says you want to start with the why and what you hope to achieve. You’re trying to build confidence and understanding in your systems. And you’re committed to learning about your systems and your teams.
He says you have to realize early on the large array of people who you need to buy into your chaos experiments. It’s very important to explain that why and then present a plan that shows you’ve really thought this out. If you have changes, you don’t want it to seem random.
Chesser continued that if you aren’t sure how they will react, have a planned experiment prepared with only certain teams that may be affected — and invite anyone else who wants to try.
Chesser says start by asking engineers what they are really concerned about. Most software nowadays is in constant evolution mode, migrating to new technology or introducing something new into your systems. This becomes your known unknowns. Your team probably has these in mind already:
- What do you want to understand about that system?
- What are our clear concerns and dependencies that you worry already exist?
- What parts of our systems do we know least about?
By planning out your experiments and broadcasting those plans, it can help identify necessary prerequisites. These can range from other people who can help you get very effective and quick feedback to knowing how to get measurements like telemetry from a database.
You can’t know how a system is functioning if you don’t know how it can be easily measured. This is where observability tooling is essential. You want to fill in any telemetry gaps so that you aren’t wasting any time not actually learning from your experiments.
By amplifying the lessons learned about your systems, you’re also helping everyone — even those that aren’t throwing proverbial poo — to get to know your systems in a new way to catch things sooner. You learn to better interpret your telemetry data too — if you look here, you also should look at this as a signal.
Chesser reminds the audience that complex compliance comes into play in a big way in a production environment. Subject-matter experts on compliance are people you want to pull into designing and running these experiments — otherwise, they become an excuse to never run tests in prod.
Also, be open to the idea that an experiment may become too big. Roll it back. Write what you learned and then reorganize it to be broken down into other experiments. Chesser reminded us that when you introduce a practice don’t assume you won’t have follow-ups. Just plan to be surprised.
And then when things go wrong in real-time, you can always point to chaos experimentation as the reason you were able to recover so quickly. Then leadership starts to buy in too.
Gremlin, MayaData and PagerDuty are sponsors of The New Stack.