Facebook Engineering Takes a Bite out of Concurrency with RacerD
Research scientists and engineers from Facebook’s Android team are giving the tech world a fun, and potentially powerful, new open source toy today: RacerD.
RacerD is being introduced as a new component of Facebook Infer, the open source static analyzer platform launched by the company in 2015. Dedicated to identifying source code bugs before mobile code is shipped, Infer works by scanning programs and identifying potential production glitches like null pointer exceptions and annotation reachability in Android and Java code, and memory leaks and unavailable API’s in C/C++/Objective C.
RacerD takes Infer’s static analysis one crucial step further by targeting concurrency — meaning that developers will now be able to automate static race detection. You read that right: a free, open source, production-tested tool that vastly speeds up concurrency bug detection.
The New Stack spoke with Sam Blackshear and Peter O’Hearn, the research scientists at Facebook who teamed up with the company’s Android News Feed engineering team to incorporate this theoretical breakthrough in automated concurrency testing into a program in production — on a platform serving two billion monthly active users.
O’Hearn, incidentally, was co-recipient (with Stephen Brookes) of the 2016 Gödel Prize, awarded for outstanding research in theoretical computer science, for inventing the field of Concurrent Separation Logic.
Could you explain concurrency to those of us who have not won any prizes in theoretical computing?
Peter O’Hearn: Concurrency is the hardest problem in computer science: trying to keep stuff consistent when things become inconsistent. When two computer programs are working on the same thing at the same time, the two concurrent activities speed to access the data. Usually one will get there first. But if they get there at the same time, if they collide, things can get into inconsistent state, where things one of the programs is doing mess up things the other program is trying to do. It’s really really hard for the computer programs to stop these things from happening at the same time, the main reason being that the number of different ways that programs can interact is many, many.
Suppose I have two concurrent activities with 50 programming statements each, and suppose I could look at, or process, one instruction every nanosecond. To consider all the different orders in which these lines of code can interact would take 3.4 million years. That is why concurrency has been such a problem for so long, starting in the early 1960s: to find the problems we can’t explore all the possibilities — brute force takes too long
RacerD, at the most basic level, tries to find these problems without exploring ALL the possibilities. And it can do this in 15 minutes instead of 3.4 million years.
What led to the creation of RacerD?
O’Hearn: Before joining Facebook, I worked on concurrency as a university computer science professor for 15 yrs. Then in 2016 I was a manager at Facebook, moving to engineer, and thought, what is the hardest problem I know? Concurrency. If we can take a bite out of this problem we can make a real difference for programmers.
Sam and I were working together on an early version of RacerD — starting from the theoretical, we had built early prototypes. Facebook scientists try to work on engineering and theory at the same time. From pen and paper to computer, then back again.
Then something really interesting happened — the engineers on the Android News Feed contacted us with a problem. They wanted to take the News Feed, this huge, constantly changing codebase, and speed it up by making it concurrent. Because of race conditions, this would be a very difficult task — and they asked, could RacerD weed out these problems before we put this concurrent code into production?
We weren’t quite ready. But thought it was a great opportunity to use theoretical computer science ideas to serve working engineers. We had Computer Science, on one hand, and engineering on the other, and we worked on both problems at the same time. This gave us a feedback loop between science and engineering that let us make both of them better than they would otherwise be.
Key idea from the theory was not exploring all the possibilities. We found some quantities to track in our algorithms that would give us info about when the concurrent activities would be in conflict. Information without doing all the exploration. Everything was still difficult, but that was the breakthrough.
What did the implementation of that breakthrough look actually like?
Blackshear: The specific problem for the Android News Feed engineers was that they had to work with a bunch of code they hadn’t written. They had ideas how to speed things up using concurrency, but how to do that on an existing production code base where one small code problem can cause an avalanche of errors throughout systems. Concurrency errors are extremely difficult to debug, or even to reproduce once they’ve been observed — making them very, very hard to fix. Previously, this would have to be done by brute force — going line by line through just an unimaginable amount of code.
RacerD makes the manual part go away. You point it at the code you want to introduce concurrency to, and it tells you where the problem spots might occur. You can then add a lock at those points, which makes migration much simpler. Using RacerD a tiny team of two engineers took six months to build concurrency into the existing Android News Feed code. Had they been required to it manually, most likely they simply would not have attempted concurrency at all.
The end result was we increased speed by five percent. Even less than one percent has a noticeable effect on user experience, so five percent is pretty significant. We were pretty happy about that.
RacerD has been running in production for 10 months on our Android codebase and has caught over 1000 multi-threading issues — which get fixed by our developers before the code ever reaches production.
So, what can RacerD do for the average developer?
O’Hearn: People are scared to make their code concurrent because the bugs can be so complicated to deal with. Now developers can be less intimidated and thus make better code.
Blackshear: As a developer of concurrent code, using RacerD you can move faster and with more confidence. Concurrent bugs are the nastiest and take the longest to work out, so we are going to find these bugs for you! If RacerD finds any issues, it will auto-comment on the line of the code change that seems unsafe, enabling engineers to find it and fix it quickly.
The current release is only for Android? What comes next?
O’Hearn: Correct, Android. But we are working on RacerD for C++ right now. And we want to broaden a great deal more. Concurrency is a longstanding problem. We have taken a bite out of it here, but there is still a lot to be done in terms of automated reasoning applied to concurrency. Program verification is an area we are definitely looking at.
RacerD was intended to be open source from the very start?
Blackshear: It is a component on the Facebook Infer platform, which is open source. So yes, RacerD was always intended to be open source, and we are happy about that.
O’Hearn: One of the main reasons we open source is to collaborate with the community. There are lots and lots of smart people in the world, and problems are easier when we all work together.