Visual search is becoming important. One recent report revealed that 27 percent of the searches on major websites like Google, eBay, Amazon and others are now for images. Another study indicates that 75 percent of online shoppers regularly or always search for visual content before making a purchase. The prominence of visual search in retail applications has made it a key component of success — but only if it works.
At eBay, we have over one billion live listings at any given time, in what amounts to an enormous digital warehouse, with literally millions of aisles or “leaf categories” such as Audiobooks, Cookbooks, Fiction & Literature for the “tree category” of Books. Predicting the leaf category based on a given image is an important task for our neural network-based classifier, and over time we have developed a considerable amount of expertise.
We begin by selecting a few top potential leaf categories based on probabilities predicted by the softmax layer in our neural network. Next, we need to know how to compare two images (the query and an item on the shelf). We represent each image with a compact signature, which is extracted by the same neural network using weak supervision. The signature has the form of a series of numbers represented as a vector. We extract a binary vector (made up of 1’s and 0’s) by training the network with a sigmoid layer to predict top leaf categories. It is best to use as much supervision as possible at all steps.
Here are some more detailed tips based on our experience:
- Understand the data and use stratified sampling.
It’s important to understand the data your neural network needs to categorize. There will often be a mix of complexities including the quality of the image, the attributes of the object itself, and the background.
The size of the training set is determined by four factors:
- The number of labels to be predicted
- The diversity of data within each label
- Memory and compute constraints
- Time budget for training
At eBay we use a stratified sampling over leaf category, sellers, condition, brands and more, finally removing duplicates. The goal is to fully capture the richness and diversity of the data.
- Use data augmentation, including rotation.
Data augmentation is a critical step in training neural networks when the training data does not capture all the variations that are likely to happen in the real use case. Synthetic data augmentation for smoothing, contrast and brightness, cropping and flipping all help, but by far the most important is rotation, which is often overlooked.
- Extract the semantic signature with as much supervision as possible.
It is very important to use as much supervision as possible when extracting the semantic signature. This helps in training the classifier to focus on informational content and discount other non-informational regions. It is best to leverage large, diverse data with low acquisition cost for strong supervision (such as leaf category prediction) when labels are not available for the actual task (measuring similarity between pairs of images).
- Analyze the Entropy of the Signature.
This step is ignored in many system designs for large information retrieval systems. It is critical to assess the entropy of the signature to effectively pack information within its capacity. For example, using 8 bits to represent the binary signature enables representing as many as 28 unique concepts. In the optimal case, each bit takes the value of 1 with a frequency of 50%. It is good to allow some slack to account for redundancy in the system.
- Maintain High Within-Class Variance When Labels Are Coarse
We use coarse leaf category labels to train the neural network. Typical classification systems aim for minimal within-class variance and large between-class variance. The ideal case is when variance is 0. In this case, all samples from a class collapse to a single point. However, this is not ideal for a fine-grained search, because fine-grained matching is not possible when the points in each cluster collapse to a single point.
For example, all samples from athletic shoes could be collapsed to a single point. But there are numerous unique products that fall under the leaf category “athletic shoes,” and it wouldn’t be possible to find them using signature similarity. For this reason, when labels are coarse and fine-grained search is desired, both between-class variance and within-class variance should be high. This can be measured by looking at the entropy of the signature (Tip 4).
- Use a Process of Elimination for Speed and Precision
The process of elimination is a very powerful tool for increasing speed and precision. Using a strong classifier to predict the top few potential partitions (leaf categories) is a very effective way to reduce the search space and also to improve precision. A dress signature will not get confused with shoe signature).
- Use Cumulative Top-K Partitions in High-Confidence Situations
When confidence in the predictions for top partitions is high, there is no need to search other partitions. When confidence is less strong, it is best to include other competing partitions. We recommend using cumulative top-k for better precision and absolute top-k only for those situations where an exact match is desired even at higher cost.
Feature image via Pixabay.