Testing is often the most arduous part of development — and something programmers are more and more responsible for in the world of DevOps and individual code ownership. As pieces of code become increasingly smaller and more distributed, there’s also a greater need to invest in automation, particularly testing automation. If developers are tearing off their code monkey labels to finally be recognized as the creative workers they are, testing automation is a great way to get rid of some of the most mundane and repetitive tasks.
The problem is that testing doesn’t always scale. The more people working on and releasing your code, the more at risk it is to fail. Now imagine the amplified bug risk of more than a million source control commands running per day or more than 100,000 commits made per week. Then factor in the nearly limitless clicks, swipes and other user behaviors that can occur on an app with more than two billion users. With numbers like these, Facebook is a perfect proving ground for scaled testing automation.
Sapienz is an application of search-based software engineering (SBSE) principles to automate testing at scale at Facebook (and beyond, once it’s open source.) Sapienz attempts to apply search techniques to automatically discover test sequences and then notify the developer of potential bugs. And now it’s moving toward suggesting the best fixes for the faults it finds.
It all begs the question — what would developers be able to accomplish if they didn’t have to worry about the boringly necessary that’s unnecessarily boring?
The Less-Than-Ideal State of Software Testing
“I have to admit that it’s a slow process and it’s a painful process,” Mark Harman, Facebook engineering manager told the room at the Facebook TAV Symposium, speaking about current industry practices in software testing. “And if we’re really honest with ourselves, it’s really unusual that a software engineer will wake up in the morning and say ‘I can’t wait to get to work so I can test my software’.”
But the problem is that, as Harman says, this means testing is regarded as “implicitly unimportant” and so it often gets put off until an urgent bug or crash occurs. With complex, distributed codebases where everyone is empowered with the ability to contribute, this is risky behavior at best.
The current state of practice in the software industry is that the engineers will design test cases and machines will then execute them. In the academic circles Harman harkened from before joining Facebook, the machines actually design the tests as well. The latter is appealing to application-based companies with massive user bases like Facebook because user behavior around app usage is so varied that there could be a limitless number of tests for each potential update.
Harman explained that “tests essentially are part of an enormous search base — far too large to enumerate. Therefore our task is to try to find intelligent computational search techniques that will find those test sequences that are more likely to reveal faults, if there are faults.”
This is search-based, end-to-end, system-level testing which treats the whole system as the input, avoiding a lot false positives, with, what Harman called, “friction-free fault-finding.”
The team at Facebook has developed Sapienz for system-level testing that is supposed to fall somewhere between random fuzzing — introducing random or unexpected data into the software — and intelligent human design.
Sapienz now supports many of the Facebook apps running on Android, including Messenger, Facebook Lite, Instagram, and Workplace, and is running, at any given time, hundreds of emulators per app.
Fitness Functions in Search-Based Software Testing for Intelligent Design
When Sapienz tests Facebook apps, the inputs are anything a user can do — swiping, clicking, etc. — and the default is defined to follow the implicit guideline: the app shouldn’t crash.
Harman said that even without artificial intelligence added to this mix, this sort of random search is “surprisingly effective” in avoiding crashes, as humans naturally test a skewed part of the search base, often ignoring corner cases.
Remember there are 2.1 billion users of these Facebook apps, so even a tiny, improbable edge case is usually significant.
So is this randomness really intelligent? One of the key ingredients in search-based testing is that each test case is evaluated for fitness in order to intelligently guide the search.
API strategist Rob Zazueta explained that a fitness function is “written to compare the output of a genetic algorithm to the desired ideal and return some value that indicates how close it is. A smart fitness function is one that is either well written… or potentially one that can adapt as needs adapt.”
When you have one solution that’s found to be “fitter” than another, Harman says that Sapienz not only drives up test coverage, but also drives down the sequence length so that it’s easier for the developer to pinpoint the fault.
“If we can imbue our search with the domain knowledge and expertise with smart fitness functions and then use those with good algorithms for search-based optimization, then that will produce a process where things that will result will be well-designed,” he explained.
This process is also successful because human-written test cases are very brittle, breaking as soon as the GUI changes. With Sapienz automatically seeding good test sequences and subsequences, it’s infinitely cheaper and there’s no need to create a store of test cases.
This also allows Facebook to test in production, only excluding the real users from the test cases, meaning the test is as accurate to reality as possible.
“In software, we have the one unique engineering material that allows us to do optimization on the material itself, not on a simulation of the model,” Harmon said.
The test results are for the true artifact, not a simulation as with aeronautical engineers and the like.
Communicating Bugs with Devs and, Eventually, Fixing Them
Of course, all of this is null and void unless you can notify the engineers about their bugs and they can then take action based on it. Facebook uses the framework FB Learner for its machine learning infrastructure and workflow automation. The recurring operator will grab the latest version of the application, build and run the test search, find crashes, apply a rule-based approach to identify the line in the DIFF and localize it, and finally report it in the review system to the developer. This is already running at full scale. Every single Facebook Android app DIFF submitted by developers, along with many from other apps, is tested by a selection of Sapienz test cases.
Of course, the true dream is automatic fixes, where machine learning and search-based testing software not only find the bugs but fix them, too. That’s the next step of the Sapienz vision.
Harman asks, “If it’s hard to automatically design a test case to reveal a fault, how much harder would it be to search for small changes to the code that would fix that fault?”
In the academic space, this is called “automated fault repair.” This involves using search-based techniques to generate candidate solutions, on which you run your test cases again, to see which is the right fix. It even checks for regressions and finds near-neighbor solutions for your existing software.
Harman explained that “If you take an existing system and you try to tweak it and improve it, like fixing a bug, that’s like trying to evolve humans from apes” versus evolving them from amino acids. Starting from no code (or amino acids for human evolution) works in theory, but simply takes too long in practice, compared to starting with something much closer to what we want and need.
Recently downloaded Messenger, Instagram and other Facebook apps running on Android now are built with software that has been automatically repaired “using search on test cases using crashes that were automatically designed using search-based software testing,” he said. “An end-to-end process that was completely automated up to the point where the patch that was found was suggested to the developer, and then the developer is the final gatekeeper to say, ‘yes that will going to the codebase.’”
This includes not only deleting bad stuff but doing partial and full repairs of DIFFs because “your testing is broken until that insta-crash is fixed.”
At Facebook, DIFFs are being released every couple of seconds so doing a full revert of one DIFF can then cause a chain reaction to others. There are also templates using Get-a-Fix, with the idea to learn from past human-designed fixes. And then when templates don’t fix the bug, they have mutations that create candidate patches which are retested through Sapienz before the winner is sent to the developer for approval.
“The idea is that our tests have to keep on adapting to stop being brittle. Human tests can’t do this, but an automated tester can keep throwing away tests that no longer run because the GUI’s changed, and grow new ones.” — Mark Harman, Facebook.
This auto-fixing system isn’t a fully perfect tool yet, but more of an in-use prototype geared at proving to developers that this can work and seeing how they respond to changes in their workflows and their loss of control.
It’s also not working at scale yet because they are only applying it toward Null-pointer crashes. This is certainly what Harman calls the worst automated testing criminal, but also the simplest to find and fix.
For this, the Sapienz team is reaching out to the scientific community and is publishing the automated fixing work in the peer-reviewed scientific literature at the International Conference on Software Engineering in Montreal in May 2019. The prototype version of the Sapienz test design system is also available as open source, and the goal is for the full product to eventually be open source.
The Facebook team wants to scale up the technology as well as an automated infrastructure for best practices of experimentation, which will then be open source.
Finally, an audience member of the TAV Symposium inquired if there is even a future for human testers.
“Automation is taking away the legwork soon,” Harman said. It’s “automating the tedious legwork that engineers do. I do think it’ll move humans up the abstraction chain just as high-level languages [did].”
Feature image via Pixabay.