Not a lot of companies want to fail. Yet we have a startup culture that’s built on learning from mistakes. Following this learn-fast habit, what if we could create a sense of stability by injecting failure continuously — kind of breaking it before it breaks you (or your bank)? That’s the idea behind chaos engineering, a growing segment of site reliability engineering that has folks throwing a virtualized sink at virtual machines and, now, containers.
Chaos engineering is the good habit of safely wreaking havoc on your systems just like a monkey would on your office. Known for its uptime as much as its binging, streaming service Netflix kicked off this now emerging best practice years ago by open sourcing its Chaos Monkey software that automates periodic outages on a distributed system “in order to build confidence in the system’s capability to withstand turbulent conditions in production.”
Chaos engineering is all about acknowledging, addressing, and embracing the complexity and unpredictability of both your systems and the people building and using them. By focusing first on critical, customer-facing services, this intentional chaos helps companies limit the business and financial impact of outages and incidents, which in turn limits broken service level agreements and lost customers.
“We like to use the vaccine analogy: injecting small amounts of harm can build immunity that proactively avoids disasters,” said Matthew Fornaciari, chief technology officer and co-founder at Gremlin.
We’ve already written about Gremlin Chaos as a Service platform. While Chaos Monkey randomly terminates a single system at a time, Gremlin brings on much more simultaneous chaos.
Now, the two-year-old Failure as a Service platform has been updated to manage mischief not only at the virtual machine level but also at the container level.
“With today’s updates to the Gremlin platform, DevOps teams will be able to drastically improve the reliability of Docker in production,” Fornaciari said.
Who could this potentially benefit? Gremlin added this feature in response to customer demands, to serve anyone using or moving to a containerized infrastructure whether it’s orchestrated through Kubernetes or other orchestrators. This is usually used in companies that have an established enough product or set of products that they have at least one site reliability engineer, but it could be used by any service owner or platform team that runs their own container service within an orchestrator.
Gremlin’s Failure as a Service for Docker now allows users to simulate dropping a host, making sure the underlying infrastructure is stable. Already in beta at select customers like Under Armour, this update allows organizations to automatically discover Docker containers within the Gremlin user interface to then run chaos engineering experiments on them. This was already a type of service discovery where the tool proliferates your networks and figures out where services live and how to access them, but it also does it now at the container level.
“We offer everything that we offer for infrastructure failure as we do for containers,” Fornaciari said, noting that there are some open source software solutions that target exclusively Kubernetes or Docker, but not everything together like Gremlin.
By Gremlin treating containers as first-class citizens, organizations are able to run attacks on resources like CPU and memory overload, to simulate latency and DNS problems, and even to randomly shut down containers. This all allows teams to realize how their architecture reacts when things go wrong and how to fix any bad reactions.
Fornaciari told The New Stack that when teams are starting to migrate to a containerized architecture, they don’t always understand the behavior, capabilities, and pitfalls. Chaos engineering allows you to test things at the container level of granularity ahead of time.
“A lot of customers push to treat containers like hosts. Gremlin is giving them the confidence of Docker in production,” he said.
Gremin’s new feature works alongside auto-healing functionality, which is when a cluster of virtual machines or containers that all belong to the same group have health checks. If the container or VM becomes unhealthy and is flagged by Gremlin, it will go into auto-healing and drop out of production use and pull another instance. Usually done within Amazon Web Services, this auto-healing occurs by either pulling itself out of the load balancer or replacing itself with a healthy incidence.
While Gremlin originally performed virtualized chaos on the bare metal, now it also brings it down on the container-based abstraction level on top of that host. It’s the first of its kind to treat containers like first-class citizens, instead of installing at the host levels and then running that on the containers.
Gremlin’s product and company values are simplicity, safety, and security. Fornaciari says the first value heavily ties into this tooling update which allows users to click through and start container tags with just four or five clicks.
“The simplicity of container discovery is that we populate potential targets for you. Instead of you having to remember tagging and identification information, you can just click a few simple boxes and go through the UI and then you’re good to go.”