EBay Explores Chaos Fault Testing at the Application Level

Online shopping giant eBay has gotten the religion on chaos engineering, but the company’s engineers are trying a novel approach: break applications, rather than the infrastructure, to see where things can go wrong.
EBay’s notification platform is the system that lets buyers know about prices, stock status, and payment status. It’s pretty important and also requires more than a few external dependencies to perform its core functionality. Identifying and troubleshooting points of failure was incredibly important for eBay’s engineers but breaking the system was not ideal to say the least. The blog post written by eBay engineer Wei Chen detailed how, by using a Java agent and creating a simulated environment, eBay was able to break everything without actually breaking anything.
EBay created a simulated testing environment to perform application-level fault testing. The testing included three patterns — blocking method logic, changing method parameter state, and changing parameter value. The application level fault injection was so successful that these practices will be further expanded upon at eBay.
Infrastructure Level vs. Application Level Fault Injection
Fault injection is the process of deliberately injecting faults into a system to observe behavior and identify weaknesses.
Turning off the network or shutting down downstream services to introduce HTTP disconnection or a timeout error is an infrastructure-level fault injection. Similarly, filling a disk to create a disk full error is also an infrastructure-level fault injection. It challenges all resources for dependent services or increases the costs of dedicated resources.
But, what if these faults were injected at the APIs rather than the infrastructure? This looks like adding latency to the http client library for internal service errors or simulating a simulating the 500 response code for internal service error. In the proper environment, this new way of fault injection provides an affordable, secure way of performing fault injections without causing harm to the underlying resources.
The Architecture Overview
EBay’s application-level fault injection took place in a simulated environment and was separated from the services themselves. Since eBay is a Java-based platform, the company’s engineers used a Java agent. Inside the Java agent, the class files of client libraries for dependent services were introduced to different faults via an API.
The main goal was to force the invoked methods to experience failure. eBay’s project includes three instrumentation patterns:
Block or Interrupt Method Logic
This instrumentation logic is straightforward — the API can throw exceptions or sleep for a specific amount of time to simulate the error or timeout.
Change the Method Parameter’s State
This applies to circumstances where fault simulation depends on the state of input parameters. In the example below, if the response.getStatusCode() value doesn’t equal 200, failure logic will trigger. To simulate faults with failure code, eBay needed to find a way to change the state of the response which will return from response.getStatusCode().
eBay added the code snippet below to achieve the desired results. Adding a try-catch block specifically caught the thrown execution and returned the code in the catch block. The execution path was changed and the desired results were returned.
Replace the Method Parameter’s Value
There are also times when the parameter value determines the method logic and so the input parameter will be manipulated. This one is a bit tricky because the input value must be known before fault injection but this value isn’t revealed until runtime. eBay leveraged Java election to get parameter names at runtime.
Creating a Simulated Environment
The simulated testing environment exists inside of a Java agent. EBay engineers implemented a class loader to instrument the code for the methods leveraged for the application code. They also created an annotation to indicate which method will be fault tested and put logic inside the annotation. The image below illustrates the annotations:
The above code snippet provides instrumentation logic for org.asynchttpclient.providers.netty.future.NettyResponseFuture.done(). A new method is created with the same signature with annotation by @Enforce, which is the user-defined Java annotation indicating instrumentation logic for fault injection. The annotation has two fields, value and type. The value field is the class name of the method. When the agent is loaded, the defined class loader will find all methods annoyed by @Enforce and inject the logic.
The type field of @Enforce has two values runtime values, default and static. The above example injects Java code but some cases may require a string literal. That example is below.
Customized Class Loader
After the logic is in the Java agent, it still needs to get to the target methods of the client libraries. And for that, a customized class loader is needed. The class loader leverages Javaassist, the instrument library, and can manipulate the Java byte code to transform the class files of the target methods to include the defined faults.
The above implementation describes fault injection for the client libraries for the three resources below:
Push Notification Endpoints:
-
- Client lib: async-http-client 1.8.3
- Fault types:
- Timeout
- Exception
- Response status code
- Message Queue:
- Client lib: kafka-client 2.5.1
- Fault types:
- Timeout
- Exception
- Distributed Store (built in-house by eBay):
- Client lib: monster-java-client 3.4.4.2-RELEASE
- Fault types:
- Timeout
- Exception
Configuration Management
eBay implemented a configuration management tool in the Java agent to dynamically change the configuration for the fault injections in the runtimes. Instrumenting javax.servlet.http.HttpServlet.service(HttpServletRequest, HttpServletResponse) will expose the endpoints for the configuration management.
The endpoint renders a configuration management page letting developers configure attributes of the fault injection at runtime. With this tool, a developer can globally enable or disable the fault injection and other subtypes of the faults; for example a timeout for AyncHttpClient.
What’s Next?
EBay will expand the scope of the application-level fault injection in more client libraries and fault categories to diversify the scenarios of experiments for our services under different types of circumstances. Meanwhile, as the configuration of the faults setting through the configuration management console can only be triggered at the instance level, they will look for a better way to broadcast changes across the cluster.