As chaos engineering becomes a more mainstream way of proactively seeking out your system’s weaknesses, we see it applied to increasingly complicated circumstances and with teams of all sizes.
One such area is serverless. After all, serverless computing is the language-agnostic, pay-as-you-go way to access backend services. This makes it multitenant, stateless, highly distributed, and heavily reliant on third parties. A heck of a lot can go wrong with so much out of your control.
From higher granularity to expanding attack surface to new failure types, serverless has so many potential points of failure, noted Thundra’s Product Vice President Emrah Samdan at ChaosConf, hosted by Gremlin. Chaos Engineering is one method to finding out where these potential failures are — before they cripple your operations.
What Chaos Engineering Isn’t
If there was an underlying theme of this year’s ChaosConf, it’d be defining just what chaos engineering is. Because, even among expert fire starters, explaining the concept is as much art as it is science.
For Samdan, it’s not about being a glutton for punishment, breaking your system because you feel like it. And it’s not about placing blame.
For him, chaos engineering is all about asking: “What if?”
Samdan said, “You need to ask your system: What if your databases become unreachable? What if your whole region goes down? What if my downstream Lambda times out? Any type of failure can happen in your systems. Chaos engineering answers these questions.”
He says you need to answer these questions to establish what are the acceptable limits of your system. He analogized it to a vaccine, injecting a little bit more resiliency and confidence into your system every time.
“Chaos isn’t a pit. Chaos is a ladder.” — Emrah Samdan, Thundra
How to Get Started with Chaos Engineering
Echoing another message from ChaosConf, Samdan reminds us chaos engineering also isn’t just for giant streaming companies. Anyone can do it and you can get started small. He even recommends avoiding doing it in production at the start.
“You can just start when you are staging. Start small. Start injecting into a relatively new service, but put your tools in and just grow stronger with chaos experiments,” he recommended.
Start by measuring your steady-state — the ups and downs of your system. He recommends using an observability tool to accomplish this.
The typical system-level metrics include:
- Memory usage
- 99% latency
- CPU usage
- Time to restore service
Samdan says typical business-level metrics include:
- Apdex score, which, according to New Relic, is a ratio value of the number of satisfied and tolerating requests to the total requests made. Each satisfied request counts as one request, while each tolerating request counts as half a satisfied request.
- Number of transactions, successful or otherwise.
Set acceptable limits for each of these metrics. Then develop a hypothesis: What happens if this happens? Some examples can be:
- What if I inject latency of 300 milliseconds on average into every Lambda function in my architecture? SLA promise: My responses will still be in the acceptable latency range.
- What if my DynamoDB table becomes unreachable? SLA promise: My system will continue performing graceful service degradation.
You can ask big questions, but then only start experimenting on the small parts. Samdan reminds you to only inject failure into a controlled piece of your system, like only injecting latency towards one function, not the entire architecture. You want to maintain that smaller blast radius.
That’s also why you only run one experiment at a time. Then you can continue, injecting latency into two, three, four functions. He says you keep going until something breaks.
“You should stop when something goes wrong, even if you are not running it in production. You should stop just to understand how you are going to roll back when such things happen,” Samdan said.
He echoed what Liz Fong-Jones said in her ChaosConf talk: that you should absolutely intentionally plan when you have your chaos experiments and let everyone know ahead.
“You don’t need to surprise other people. You don’t need to surprise other departments. And, most importantly, in production, your customers should know about it,” he said.
So if something goes terribly wrong, they aren’t worried because you talked about it ahead and you already had a plan to roll back which you also shared with them.
How Chaos Engineering Works Differently in Serverless
Chaos gets way more complicated in serverless environments, which are highly distributed and event-driven. Risks with serverless tend to come from the services you don’t have insight or control over. Essentially, serverless is chaotic at its heart.
With serverless you inherit a whole new set of failures, within its many resources, which can include:
- Run out of memory, as you only have limited memory.
- A function can just stop.
- A function can run too slowly and then timeout.
- You can hit the concurrent execution limit.
- Misconfigured access permissions can prevent your function from running or can give someone more access than it should have.
- Poisonous events can happen — like poisonous messages make your Lambda function retry hundreds of times until that data expires.
Samdan says these are ticking time bombs if you are just communicating with the other system synchronously, waiting for a response.
In serverless, there are also failures you tend to create, like:
- Synchronizing communication with unresponsive downstream services.
- Errors cascading up to users from cloud services or other third parties.
- Bad coding practices like recursions, which can go over and over and cause your application to time out.
- Errors that roll over from previous developers.
But all of these flaws are easy for you to interrogate with chaos experiments, like:
- You can inject failures into your Lambda functions or other service interactions.
- You can inject latencies into those same functions or interactions.
- You can play around with the concurrencies of Lambda functions.
- Play around with IAM permissions.
- See what happens when downstream services become unreachable.
Everything follows the same pattern:
Samdan says latency is the most important serverless metric to experiment against because, in serverless, if the response is late, that’s often a sign the service is broken.
He says a common fix for serverless issues is to aim for asynchronous communication whenever possible and then properly tune synchronous timeouts.
Other serverless fixes include putting circuit breakers in place and using exponential backoff to find an acceptable rate of pacing retransmissions.
Samdan says chaos engineering is about learning exactly how your system is supposed to behave when something happens. And it allows you to make a plan for how you respond to issues as a team:
- Alert systems and who gets pinged.
- Impact measuring.
- Your response.
- How and when to notify users.
This systematic, continuous experimentation doesn’t just improve your system. Samdan reminds us that it also improves team communication.
“You need to just make your system ready with chaos engineering because, if it is a serverless system, you should never stop running experiments.” — Emrah Samdan, Thundra
“You should never make it harder for your teams. You should never stop to hug the ops. Incidents can happen. We are here to improve ourselves, not hurt others,” he said.
Gremlin, New Relic and Thundra are sponsors of The New Stack.