The old joke about predictive analytics is that it is great at predicting the past. But one Princeton graduate student recently explored whether data science can really predict the future. Using nothing but data from Yelp, Ph.D. candidate Michail Alifierakis claims to have developed an algorithm that correctly guesses whether a restaurant will close during the next four years — with 91 percent accuracy.
Are Stars Meaningless?
The first surprise? You can’t guess which restaurants would close from the average number of stars. The Yelp star rating distributions look very similar for open and closed restaurants. So Alifierakis decided to test some alternate predictors of success. For example:
- Is the restaurant part of a chain?
- How many other restaurants are nearby (within one mile)?
- How does it compare to nearby restaurants (based on price, but also its average rating — and the number of reviews)?
- How old is the restaurants? (Estimated by the date of its first Yelp review).
Yet, as it turned out building more complicated models on these factors didn’t improve the results — so instead Alifierakis tried optimizing the parameters that he already had. A linear logistic regression model posits a limited set of outcomes — say, open or closed — and tries to estimate their probability using related (or “dependent”) variables.
Alifierakis investigated this by making a grid of all the possible relationships, carefully calibrating the weight given to each variable — and ultimately achieved what he claims is a precision of 91 percent when predicting open restaurants.
For instance, a bank using a model to determine which loans to give out would potentially have a four-year default rate of 9 percent, while a bank that gave loans to all restaurants would have a four-year default rate of about 23 percent.
Interestingly, it’s much harder to predict which restaurants will close. “Among the restaurants that are predicted as closed, only 36 percent of them actually ended up closing in a four-year period,” he wrote. But Alifierakis argues that investors care more about which restaurants will stay open, and that’s the use case he’s optimizing for.
So What Determines Success?
One interesting wrinkle: In the end, Alifierakis did not use any information from Yelp reviews, only metadata, such as the number of reviews and date of the first review. So what does determine success for a restaurant?
Here are the top four features for predicting if a restaurant stays open:
- Is it part of a chain? (“This is not surprising as restaurant chains usually operate at a higher profit margin than individual restaurants,” he noted)
- Does it have more reviews than its nearby competitors? (“A large number of reviews is an indication of higher traffic in restaurants but it is also a reason to appear higher in Yelp search results, which by itself can drive more traffic.”)
- Has the restaurant’s owner “claimed” their page on Yelp?
- The number of reviews per week. This is actually more important than the total number of reviews.
Things like average star rating and price also show some correlation — but not as much as the factors above.
And what are the top predictors that a restaurant will close?
- The number of nearby restaurants competing with it. [Though interestingly, having several similar restaurants nearby makes a restaurant more likely to stay open.]
- Whether it’s getting more reviews per week than its competitors (relative to those competitors). Though strangely more reviews (on average) makes a restaurant more likely to close. Since this number is calculated by dividing total reviews by the number of weeks a restaurant has been open, it’s strongly affected by a restaurant’s age — which obviously correlates to a restaurant’s likelihood of remaining open.
- Reactions to reviews per week.
The standard deviation of its average star rating — as well as the median — also showed some correlation, but not as much as the four factors above. It’s interesting to note that price seems more predictive of whether a restaurant will close than whether it remains open.
The Man Behind the Algorithm
Alifierakis is studying in Princeton’s chemical engineering program, though he’s also “a strong believer in the power of data to transform the world,” according to his personal web page. It adds that he has experience with both data science and software development, but more importantly, he’s also worked on a food delivery startup through Princeton’s incubator program. Called “Chow Fleet,” it delivers food to the Princeton campus from restaurants which otherwise don’t offer delivery.
But late last year, Alifierakis also partook in the Insight Data Science Fellow Program — an intensive seven-week post-doctoral training fellowship, where he spent three weeks building his data model.
“Yelp reviews text is very predictive of restaurant closure on short time scales,” he wrote, citing a 2014 study by an assistant professor at the Smith School of Business, which achieved 70 percent accuracy in its predictions of restaurant closures within the next 90 days. This inspired Alifierakis’s own research, which became a testament to both insight and dogged persistence.
He started with an old Yelp dataset offering thousands of restaurants in Phoenix, Arizona from 2013. Narrowing it down to only restaurants which were still open, he almost immediately ran into a problem. Yelp doesn’t provide the real business IDs — which you need when using the Yelp Business API to pull up data on individual restaurants. Fortunately, Alifierakis was able to obtain about two-thirds of those just by feeding each restaurant’s name and address into Yelp’s search API.
Unfortunately, that seemed to create a bias towards restaurants that were still open. So to create a more complete dataset with most of the restaurants, he used Google‘s search tools on the yelp.com domain, checked the results closely, and then fed the resulting business IDs into Yelp’s Business API. In total, the final dataset contains 3,327 restaurants, about 23 percent of them are no longer in business.
80 percent of the data was used as a training set, while 20 percent was reserved for testing.
What Happens Next?
Alifierakis’ essay ends with a grand conclusion: “The results of this model are very promising and they indicate a significant improvement for lending purposes relative to a random model.”
He points out it could be improved by identifying more factors to look at — possibly even looking beyond Yelp for data. Ratings from health inspectors could foreshadow the likelihood of closures in the future — though he adds that data is “not publicly available for Phoenix at the moment.” The typical rents for a region also will obviously impact profitability. You could also look at data about population demographics or new venues opening nearby.
But there’s one more very important caveat. “Success of a restaurant is currently defined as the restaurant remaining open.
“A more accurate definition of success that would be more appropriate for lending purposes would be correlated to restaurant revenue,” he concluded.
- Can self-driving cars eliminate the need for traffic lights?
- Startups now want to build self-driving corner stores.
- Omron wows CES with a robot that plays ping pong.
- How China monitors 870 miles of canal with 100,000 IoT sensors.
- On its Silicon Valley campus, Oracle just opened a high school for geeks.
- Oregon launches a nano-satellite to inspire high school students.
- How videogamers perform sevens days of nonstop livestreaming for charity.
- YouTube interviewer enjoys 45 minutes with Linus Torvalds.
- What can modern software developers learn from 1970s mainframe programmers?
- One online security expert remembers a 2,000 sting operation with the FBI.
- How cheap power supplies are attracting bitcoin miners to Wenatchee, Washington.
- The New York Times visits one of America’s last pencil factories.