Gremlin Applies Chaos Testing to Serverless

4 Oct 2018 3:19pm, by

Expanding upon its failure-as-a-service platform, Gremlin now offers the ability to induce failure in serverless services, with the aim, as in all chaos engineering, to give administrators a better idea of the impact of real-life system failures.

The feature comes from the company’s new Application-Level Fault Injection (ALFI) technology, which provides a way to do failure testing at the application level by inserting breakpoints into the developer’s code itself.

ALFI is an advance in the state-of-the-art for chaos testing tools, in that “brings the failure injection up a level stack into the application,” said Matthew Fornaciari, Gremlin chief technology officer and co-founder, speaking in an upcoming episode of the TNS Context podcast. Existing chaos tools do their disruption from the OS or infrastructure level, he explained.

This approach offers a far more nuanced control over failures. Failures can be scoped to only impact particular attributes, such as customer IDs, locations, device types. With this specificity, users can limit tests to a “really scope down to what is really getting impacted by the failure,” allowing them to quickly test and retest, Fornaciari said.

ALFI works with all major serverless platforms, including AWS Lambda, Azure Functions, and Google Cloud Functions. “You can use it with the serverless service like you would with any other infrastructure. It’s really not that different,” Fornaciari said. “You define that experiment and what the impact will be and every instance of that Lambda coming up will go fetch its experiments, and impose them.”

ALFI can also simulate the delay and failures of specific services. This feature can be uniquely valuable for the microservices environment when one slow component can cause a great deal of congestion as other services that depend on this component compensate through retries and time-outs. In the industry, this is frequently called a cascading failure.

ALFI offers two sorts of failures that would be useful in replicating such failures: delay and exception-throwing. With these tools, you can observe what a microservice does when all the components around it fail. How resilient is the service if the identity service goes away? Or what if the recommendations are slow to load?

Chaos engineering is the emerging discipline of deliberately introducing failures into a system to better anticipate larger failures. Netflix brought attention to the practice with its own practice of the craft, using its Simian Army of open source tools that bring disruption to cloud native systems.

In addition to ALFI, Gremlin’s platform itself can execute 11 infrastructure-based failure modes, such as packet loss and delay, clock-skew across servers, and show the impact they will have on the system as a whole. Gremlin is also the first set of software that also packages chaos-inducing tools that can be used in a container environment. In August the company expanded the toolset to include the ability to target Docker containers. Users can shut down the specific container, or starve them of CPU cycles or memory space. The software also has an auto-healing capability that, when a container is flagged as unhealthy, it will be replaced by a duplicate.

Gremlin recently announced that it had raised  $18 Million Series B funding round led by Redpoint Ventures, money that will be used to push forward the scope of resilience engineering in the enterprise. With this influx of capital, the company plans to hire engineers to, among other things, investigate expanding failure injection into additional targets such as Kubernetes, Fornaciari said.

Feature image: Gremlin CEO Kolton Andrus.

A newsletter digest of the week’s most important stories & analyses.