Divide and Conquer: Distributed Deep Learning Exponentially More Efficient in Search
For many, shopping online has become a preferred alternative to hopping from store to store, spending hours looking for a certain size or for the best price. After all, buying things online is easy and convenient: type in a search term, and voila! — page after page of possible product choices — all found without actually having to get out of the house.
But have you ever thought what goes on behind the scenes as you input your search term? And what kinds of computational resources might be needed to retrieve those search results, especially when companies like Amazon reportedly sell over 3 billion products worldwide? As one might imagine, training a neural network to understand the user’s semantic intent, and to then search, learn and produce relevant results from such a vast search space would require a lot of computing power — resulting in a system that would be difficult to scale up in the future, especially as more and more products are added, and as conventional silicon-based hardware struggles to keep apace.
In technical terms, experts call this an extreme classification problem due to the issue of having so many possible outcomes. To tackle this, researchers from Rice University have devised a simple but ingenious “divide-and-conquer” algorithm, yielding training times that are seven to 10 times faster and memory footprints that are two to four times smaller, in contrast to other large-scale deep learning techniques. These results were gathered from tests on an Amazon search dataset that contained approximately 70 million queries and over 49 million products.
“We have exploited a fundamental connection in extreme classification and compressed sensing [a signal processing technique] to come up with an exponentially cheaper training and inference algorithm,” said Anshumali Shrivastava, an assistant professor of computer science at Rice University and one of the paper’s authors. “The most exciting part is that the algorithm by nature is embarrassingly parallelizable. So despite being cheaper, it is also infinitely scalable: we can use the computing power most optimally. It is generally rare that an algorithmic improvement by nature is hardware-friendly. Usually, it is the other way around.”
“Exponential Reduction of Search Space”
Dubbed “Merged Average Classifiers via Hashing” or MACH, the algorithm’s superior efficiency comes from its use of a probabilistic data structure known as count-min sketch. Without getting too much into the mathematical details, the algorithm uses universal hashing to reduce the huge number of classes into a smaller number of parallel and independent classification tasks, dealing with fewer but constant classes.
“Our procedure is straightforward but very counter-intuitive until you see the math behind it,” Shrivastava told us. “The procedure involves combining objects (or products) randomly into “bins” and training a classifier to predict bins, instead of objects. During inference, the predicted bin probabilities are combined to identify the object.”
To understand how that might work in practical terms, the paper’s lead author, Tharun Medini suggested an interesting thought experiment using the example of online shopping: “Let us put a conservative estimate on the number of products as 100 million. Conventional methods project the query and all the 100 million products into a dense vector space. To retrieve the most relevant products, these methods perform a nearest neighbor search over all the product vectors. This step involves heavy computation and is often wasteful.”
Instead of training on 100 million products, MACH takes a different approach by randomly pooling all 100 million products into three bins. MACH then creates another ‘world,’ and again sorts the 100 million products randomly into three different bins. Most notably, the distribution of the randomly sorted 100 million products is different and distinct in each world. One classifier is trained for each world in order to assign searches to the three bins only, rather than the products within.
“Now, a search is fed to the first classifier and it says bin three,” continued Medini. “The same search is fed to the second classifier and says bin one. The most probable class is something that’s common in these two bins. So we need to look at possible intersections between these two bins. By creating six classes (three in each world), we’ve reduced the search space to 1/9th of the entire product space (1/3 x 1/3). If we add one more world with three bins, we have a common search space of 1/27 at the cost of just nine classes. By paying the linear cost, we’re getting an exponential reduction in the search space. Hence, our method is more effective than prior approaches.”
The group is now working to expand MACH to take on computationally intensive challenges like predicting protein structure or locating genome sequences. As the team notes, since their algorithm brings down computational requirements by orders of magnitude, it, therefore, could play a crucial role in improving the efficiency of any deep learning model dealing with a massive number of possible outcomes — whether it’s searching for relevant products online, or a natural language processing model that handles questions.
“There may be several other AI problems for which similar reductions exist,” Shrivastava pointed out. “In the future, we should keep our eyes wide open and look for other fundamental mathematical tools that can make AI more efficient. We cannot hope that hardware will cope with the exponentially growing demands. He noted that recent reports indicate that training a single AI model “can emit as much carbon as five cars in their lifetime, so there is a dire need to find energy-efficient machine learning algorithms. The best hope of sustaining the growth in AI is finding clever and more efficient algorithms.”
Read the team’s paper, or check out MACH’s code on Github.
Images: Rice University