Safe to Fail: Reinforce Distributed Systems with Chaos Engineering
She spoke at this year’s ChaosConf, hosted by chaos engineering service provider Gremlin, about how observability — and the goal it serves, site reliability engineering — is still an emerging and evolving field. Chaos engineering — often associated with massive enterprises — is a logical approach for a startup to find system weaknesses, fix them and test for them again. As you’ll learn about in this piece, chaos engineering is not really chaotic at all. It involves measuring your systems, understanding their weaknesses and thresholds, and then systematically attacking one small piece at a time. All in the name of making the system stronger.
Investing in this continuity is what enables velocity in product development. This involves specifically working on making stateful systems reliable and consistent.
Fong-Jones says consistent and stateful services “are some of the scariest things to work with in terms of having a bunch of hidden dependencies and assumptions that might break and things that are harder to test with chaos engineering techniques.”
She says it gets even scarier when you have services that are storing data. But she had found that chaos engineering is one of the best solutions for Honeycomb to maintain state, stability and sanity.
Chaos Engineering Step 1: Quantify Reliability
Any time you are trying to take down your systems, well, you may very well take down your systems. That means you need to know how far you can push them before it interrupts your engineers or your customers.
Fong-Jones explained, “We are trying to achieve a certain level of reliability that meets customer expectations while still allowing us to innovate.”
Your chaotic journey begins with defining what “reliability” means for your systems. Understanding what “good enough” is, what “too broken” is, and how you are measuring your quality of service.
“If the system is operating very, very reliably, and is exceeding our reliability targets, at what point do we need to spend the time and invest that error budget or amount of downtime we are allowed to have in order to explore the potential failure cases?” — Liz Fong-Jones, Honeycomb.io
She pinpoints the service level agreement or SLA as the common language between customers, product managers an engineers that lays out what are your customers trying to achieve and how you are measuring it. From here you map all your planning toward targeting the levels put forth in the SLA.
Fong-Jones says you start by measuring “achievable” reliability inside of a data point. For Honeycomb that’s not 100 percent reliability.
As observability is all about the customer’s data, Fong-Jones said the key metric for Honeycomb is clear.
“We only get one shot to store incoming telemetry so our customers are screaming data at us all the time and we’ve decided that we want to ingest 99.99 percent of events,” she said.
In addition, less than 0.1 percent of those that visit the Honeycomb homepage should experience more than a second of load time, and the team aims to have generic queries of any kind fail less than one percent of the time.
Fong-Jones says reliability targets must be directly tied to customer expectations.
And it seems for sure the theme definition of Chaos Engineering is at the heart of Agility: Creating environments that are Safe. To Fail.
— Jennifer Riggins 🖊 (@jkriggins) October 6, 2020
Chaos Engineering Step 2: Design Experiments
Engineering is a science and chaos engineering leverages the scientific method we all learned in primary school:
- Observe: This is happening.
- As a question: What happens if this happened?
- Create a hypothesis: If this happens, then this will happen.
- Conduct the experiment.
- Observe some more.
- Make a decision.
- Try again until desired outcome.
Your experiments should factor all these steps in.
“We design experiments to prove the risk to verify our assumptions and verify that the resiliency techniques we are using are going to hold up in production,” Fong-Jones explained.
However, it’s not that simple. She pegs data persistence as “really, really tricky” to get successfully stored, particularly if it’s in a system that has a lot of moving parts.
For data, Honeycomb uses a mix of Kafka, Zookeeper, and Retriever, its homegrown storage engine to persist data. The company makes regular changes on the rest of its systems, but they only make changes to its data stack less than once a month.
“The changes are very, very infrequent and a lot harder to test because they are not exercised every single day,” Fong-Jones said.
About a year ago, they started asking questions about these long-running processes, like:
- What happens if one of them unexpectedly restarts?
- Even if it does restart correctly, how do you verify the data integrity and consistency?
- What happens if one machine restarts?
- Where are their single points of failure?
“Spoiler Alert: We discovered some problems, but because of the error budget left over, we were able to perform production experiments,” she said.
The team started an experiment by only restarting one Zookeeper at a time in one environment. Then they watched for changes. What’s happening? Are there any changes from the steady state?
Chaos Engineering Step 3: Fix What You Found
This is when you close the feedback loop by addressing any risks and solving any problems you discovered. Then perform the experiments again.
Once they found problems, they used Honeycomb-powered observability to debug them.
“Observability really empowers you to understand things that you didn’t anticipate happening, which is by the very nature what we are hoping to validate and verify with our chaos engineering experiments.” — Liz Fong-Jones, Honeycomb.io
They realized things like, when they restarted Zookeeper, it stopped sending alerts to any internal customers for five to ten minutes. They observed that their processes relied on having leader election to run one copy of those Honeycomb alerts.
The Honeycomb team then ran another experiment to make sure everything they were measuring — their telemetry — was running right. If something fails, then the telemetry should agree it is failing.
Fong-Jones explained, “Otherwise, your telemetry might tell us everything is fine when it isn’t.”
Chaos Engineering Step 3 and 1: Repeat
So you fix something and you’re done, right? Well, that fix could trigger something else to break. Or maybe it wasn’t fixed at all. Chaos engineering dictates that you should repeat the experiment to make sure everything works in the expected way.
The Honeycomb engineers made sure to fix the problem where their nodes were only talking to the first Zookeeper instance. But then when they re-ran the experiment, they discovered another issue that also stopped Zookeeper from restarting properly. This time a teammate had pushed a broken Chef.
Repeating experiments is what takes teams to the next level toward surviving outages.
Once you’ve run the experiment a few times, you can automate that experiment running with a chaos-as-a-service toolset like Gremlin. Or you can continue manually.
Via Gremlin, Honeycomb now automatically restarts at least one node every week to verify that the Kafka replication is working correctly and that their systems can recover from individual Kafka node failures.
“The more you perform chaos engineering, the better confidence you have that your system is going to be able to survive things like your nodes going away on you,” Fong-Jones said.
Bonus, the Honeycomb team, through chaos engineering, discovered a direct correlation between systems flexibility and money saved. Now that they know their nodes can survive being restarted on demand, they can adopt preemptible instances, which Fong-Jones says can be half the cost. Plus they don’t need to perform active chaos engineering on those instance performances because they now know how they’ll perform.
Chaos Engineering Rule: It’s a Daytime Team Effort
Some aspects of the tech culture phenomenon progressive delivery has dev teams doing some progressive rollouts at odd hours in order to affect smaller or less important customer populations. Chaos engineering is not one of them.
Don’t run your chaos experiments on the weekend, Fong-Jones said. “The idea is bugs are shallow with more eyes, so let’s make sure all hands are on deck and let’s make sure that we have the ability to revert the experiment as quickly as possible.”
The more people involved, the easier it is to find the causes behind the system flaws. It’s also about minimizing the impact. More people helping find bugs, means less dire consequences.
“The other thing that’s really key to chaos engineering for us is to really limit blast radius, to develop a hypothesis of what we think is going to happen, and restart only one server or service at a time,” Fong-Jones said.
Then, the Honeycomb team can continue to measure against their SLA and to see if they have more error budget to then expand their chaos engineering, which in turn should increase their systems resiliency.
Gremlin and Honeycomb.io are sponsors of The New Stack.