A few years back, when Bit.ly needed to burn in its new Hadoop cluster, the company’s chief scientist of the time, Hilary Mason, and her team decided to use the cycles to analyze three years worth of Bit.ly traffic simply to determine if cats or dogs were more popular.
Dogs won, it turns out. And Mason freely admitted that the project as “a massive waste of energy.” But it also offered a valuable lesson in what was possible.
“The idea that we had computation power so cheap that we could apply it to something so absolutely trivial really blew my mind,” she told the audience at the Anacondacon conference, held earlier this year in Austin Texas. Something that wasn’t even feasible a few years back is now so easy to carry out that it could be applied to even the most trivial uses.
And that is where we are with data applications today, Mason explained.
Thanks to abstractions and frameworks, development that used to be done in weeks by a team now can be done in a single day, by a single coder. There’s a lot of good work going on with natural language design, unstructured data parsing and other data analysis techniques. And by this point, most businesses have a glut of data merely as a side effect of doing business. Why not put all these elements together?
This year’s conference, run by Continuum IO, addressed how the company’s distributions for Python and the R statistical language are being increasingly used for this sort of data science work. And Mason was the perfect person to address this emerging market. After leaving Bit.ly, Mason founded Fast Forward Labs, a Brooklyn-based consultancy focusing on helping organizations design data-driven apps.
Mason defined data products as any app or service that relies on data to produce value of some sort for the user. Perhaps the best example of a successful data product is Google Maps, which relies on real-time location data and a set of prediction algorithms. Mason noted that Google Maps has been successful chiefly because it is “boring,” she said. “Anyone in western society can look at this and they’d know how to read it.”
The generic formulation of the problem you are trying to solve is 10 times more complicated than the specific problem you want to solve.
“There’s a lot of computation here but you don’t need to know anything about it to use this product,” she said. “And yet it could not exist without that computation.”
In general, a successful data app addresses some need of high value to the populace at large: executing tasks we would otherwise pay a lot of money to complete, and/or it executes a task that is important to do well. Google Maps exceeds in both.
Before you embark on building your own Google Maps, however, here are some issues to contend with that Mason advised bearing in mind (“and there are a lot of them, unfortunately,” she noted).
Here is one to keep in mind, for those thinking of releasing an internal app for public distribution: The generic formulation of the problem you are trying to solve is 10 times more complicated than the specific problem you want to solve. Getting software to work in a specific use case, namely yours, is much easier when you have only one domain to content with. Harder is extending that software across all the possible domains where it could be useful.
“It is so much harder that it doesn’t happen well. You end up with a lot of mediocre solutions and lost opportunities,” Mason said. “It’s a big problem in the data product world today.”
Sure, most businesses that would buy software rather than build from scratch. But, for the provider of that software, there will be a lot of edge cases to contend with as well, which can be severe in terms of regulatory environments or that involve the well-being of humans in some fundamental way.
“The biggest trick ever for data scientists is that if you can’t solve the problem you want to solve, find an adjacent one that is much simpler that you can solve.” — Hilary Mason
A second factor to consider is that data science is not standard software development. “The process around the technology development is not accommodating with data science and it’s unique needs,” Mason said. The general approach for standard software development is to find the simplest possible algorithm for what is trying to be accomplished at whatever scale that the task needs to be executed.
Any data science-based process, however, should be constantly updated. The world changes, data drifts. Models may need to be retrained. The organization needs to develop a quantitative test mechanism. Data from the production model needs to get back into the testing phase. and this cycle between testing and development, ideally, should be automated.
“There’s a lot of people doing this very well, but there is no standard way of doing it. There is no one set of best practices,” Mason said.
Another aspect to keep in mind is that working with data can be more complicated than might initially appear. At Bit.ly, Mason and her team tried to create a data-driven app that would estimate the number of calories of food pictured in a photograph. Turns out that there are many sources on the Internet that list caloric content of food, but none of them agree with one another. A burger could range from 300 to 2,500 calories. A cleaner data model was needed.
In the end, the Bit.ly team simplified the app to just letting the user know if the food in the photo is healthy or not. “The biggest trick ever for data scientists is that if you can’t solve the problem you want to solve, find an adjacent one that is much simpler that you can solve,” she said.
Adding artificial intelligence and machine learning complicates the matter further still. Training models to recognize objects in a photograph can be fraught with pitfalls. One demo that Mason built for a burger restaurant kept identifying french fires, when appearing in a photo with a body of water in the background, as crabs. In another case, photos that Mason took herself of grubby New York City subway stations were identified as correctional institutions or prisons.
“We didn’t have the subway in our training set, at all. And this closest thing in the training set was this prison data,” Mason noted. “Because these models themselves aren’t interpretable. There is no easy way to go into edit and say ‘Don’t say the subway is a prison.'”
Feature image via Pixabay.
The New Stack is a wholly owned subsidiary of Insight Partners, an investor in the following companies mentioned in this article: Bit.