Monitoring / Tools

Gremlin Launches ‘Health Checks’ for Chaos Engineering with Greater Control

30 Jun 2020 10:01am, by

Chaos engineering is one of the truest signs that engineering is a science. Like the scientific method, this resiliency practice has engineers hypothesizing, experimenting, measuring, pivoting (or not), and repeating. This conditional game of “If this, then that” pushes a system to its limits to build confidence in its ability to stand up against unexpected roughness in production.

But you don’t want to administer a vaccine on an already weakened immune system. The nurse checks your temperature and asks you how you’re feeling before the jab. Similarly you don’t want to administer chaos when your system is already dealing with a huge traffic spike.

“If the system is unhealthy this is not the time to introduce chaos,” said Ana Margarita Medina, chaos engineer at Gremlin.

With this in mind, this June, Gremlin, a provider of a chaos engineering platform, released a new feature called “health checks” to automatically check that your system is healthy before unleashing the chaos — every time. It integrates with other IT monitoring software such as those offered by New Relic, Datadog and PagerDuty.

Medina says it’s about transitioning from a reactive to a pro-active world, making sure you can build resiliency into systems without causing unnecessary harm. “The point of chaos engineering is not to add unnecessary chaos. You want to control the chaos in your system,” Medina continued.

Gremlin is built on three values — simplicity, safety and security. This feature emphasizes the safety goal.

Gremlin health checks connect via API with your existing monitoring system to verify if your system is actually healthy. Gremlin checks for things like: Are there any incidents open for this system? How’s the traffic load? Does the website return a 200 OK response status?

If all is smooth, then Gremlin runs a chaos engineering experiment on your systems, followed by another health check.

Why don’t you just want to pile chaos on top of chaos? Because this practice is all about controlled chaos so you can learn more about your infrastructure and pinpoint weaknesses.

Medina said that these guard rails were requested by Gremlin’s Fortune 500 customers who wanted to feel more confident about taking their chaotic experiments into production.

“You want to automate but you don’t want to upset customers. Big companies don’t have time to do it manually but are afraid to automate it,” Medina said.

Allowing for these automatic checks before, during and after an experiment closes the chaos engineering feedback loop.

This plays well alongside an original feature within Gremlin — a big red button at the top of the screen that yells in all caps: “Halt all attacks.”

Screenshot of Gremlin UI including a big red button that shouts HALT ALL ATTACKS in upper right corner

Medina explained that “If you’re manually running an attack and it’s actually impacting the systems in a way you don’t want, it’ll [the system] automatically return to the state it was in before the attack,” as soon as that red button is hit.

She calls these health checks like a “proactive Halt button.”

Medina says it’s not just about systems either. It’s about people.  “We’re seeing a lot more folks are realizing that they don’t want to always be fire fighting. This leads to engineer burnout and too many un-actionable pages” for on-call engineers.

She goes further to suggest that these automated checks then chaos deploys should be done nine-to-five, Monday through Friday, so you get rid of the need to chaos page out-of-hours at all. If Monday at 9am there are incidents, then the chaos won’t happen until that’s fixed. The check will run again on Tuesday. If the incident is fixed, then the chaos will rain down. If it’s not, the chaos will pause pending Wednesday’s health check.

Just like you run the health checks continuously, Medina also suggests running the chaos continuously.

“Avoid the drift into failure. You know the cloud is very dynamic and things shift around. Run the attacks all the time. Make sure you’re still resilient to the same failure. Automate it in a safe way,” she said.

Feature image by bdyczewski from Pixabay.

At this time, The New Stack does not allow comments directly on this website. We invite all readers who wish to discuss a story to visit us on Twitter or Facebook. We also welcome your news tips and feedback via email: feedback@thenewstack.io.

A newsletter digest of the week’s most important stories & analyses.