Kairux: Identifying and Debugging Root Causes
Anyone who’s worked with distributed systems knows there are far too many opportunities for something to go wrong. Pinpointing the exact spot in the source code is like looking for a needle in a haystack.
Kairux is a fault localization tool for Java-based systems, one that uses unit testing and adaptive dynamic slicing to work backward from the failure execution to identify what went wrong and why.
By implementing the Inflection Point Hypothesis — that the root cause of any failure is also the step where a failure execution and successful execution diverge — Kairux can identify the bug and present the source code needed to understand why it happened.
About 77% of failures in distributed systems are caused by more than just one thing. There’s no perfect solution, and picking the “right” root cause is subjective. While Kairux can’t guarantee that the “right” root cause is selected, it can identify the last root case so by continuing to apply Kairux repeatedly, multiple bugs in an execution could be identified.
Kairux first reduces the potential number of successful threads to compare against, then returns the one with the longest common prefix to the failure thread.
This research project was highlighted this month by Usenix, though the authors of the project have not indicated whether they plan to open source the code, which works for Java code, or embark on a commercial implementation.
What’s the Inflection Point Hypothesis?
The Inflection Point Hypothesis is simple and straightforward. The inflection point is the root cause of an issue, the first instruction where the failure execution diverges from the non-failure execution. So, the best way to pinpoint the inflection point is to find the successful execution with the longest common prefix to the failure execution.
The inflection point holds significance as the last step in the failure execution where failure is still avoidable. By using this hypothesis as the guiding principle, fault localization becomes a principled search problem.
Kairux Automates the Inflection Point Hypothesis
Kairux is a set of algorithms for Java-based distributed systems that abstracts away all the heavy lifting. It takes the following inputs:
- Steps needed to reproduce the bug, usually packaged in a unit test.
- The failure symptom itself.
- The source code.
- All unit tests.
In order to output:
- The inflection point.
- The instruction sequence with the longest common prefix.
- The steps needed to reproduce the instruction sequence with the longest common prefix.
The “why” and “how” of the bug can be understood by comparing the failed execution’s source code with the longest successful source code. The “where” is more readily located by minimizing the number of potential sources. Kairux reduces the potentially infinite number of combinations with the following key concepts:
- Adaptive dynamic slicing: Only sequences and source code related to the bug are considered. All remaining target instruction sequences are separated into subsequences in different threads with the thread containing the bug processed first before adaptively extending the analysis to other threads.
- Unit test utilization: Kairux only considers successful sequences included in existing unit tests’ and that list is further minimized by prioritizing the code most similar to the failure execution first.
- Valid execution modification: To make the best use of the test-provided sequences, Kairux attempts to modify the test’s input parameters to reduce any divergences from the failure thread when they arise. When the updated parameters no longer lead to valid executions, the attempts end.
Greater detail about Kairux’s algorithms can be found in the technical paper.
Architecture and Implementation of Kairux
Kairux uses static analysis over dynamic analysis due to the high overhead costs associated with dynamic analysis. The static slice is a super-set of the instructions that belong in the dynamic slice of any failure execution.
Kairux builds the dynamic slice it needs by setting breakpoints, then reproducing the failure execution. Each breakpoint it’s able to hit is recorded and a trace is obtained, followed by a dependency analysis, and then annotating network communication libraries. From this, a dynamic slice is obtained.
Additional work includes the use of breakpoints to enforce different thread scheduling and assigning unique tags to differentiate different runtime instances of a source code object and track data flow.
The End Result
The short answer: it works. The root causes for failures were located 70% of the time. The number of sequences examined were reduced from a possible infinity to 0.2% in the process. Failures caused by “missing events,” something that should have happened but didn’t, and failures caused by anomalous events were both explained.
Kairux was evaluated on randomly sampled, real-world failures from HBase, HDFS and ZooKeeper.