What news from AWS re:Invent last week will have the most impact on you?
Amazon Q, an AI chatbot for explaining how AWS works.
Super-fast S3 Express storage.
New Graviton 4 processor instances.
Emily Freeman leaving AWS.
I don't use AWS, so none of this will affect me.
Frontend Development / Software Development

Uber Eats Carves out a Cache to Deduplicate Images

By implementing a basic hash map and control flow logic, Uber Eats has cut the number of images its servers send out to 1% of what was previously delivered.
Sep 13th, 2022 4:00am by
Featued image for: Uber Eats Carves out a Cache to Deduplicate Images

By implementing a basic hash map and control flow logic structure, popular food delivery service Uber Eats has created a content-addressable caching layer that cuts the number of unique images sent out to 1% of what was previously delivered by its servers.

Given the services serves millions of images to end users every day, the savings in resources is considerable. And the solution very much reads like a LeetCode problem.

“This full system was developed and completely rolled out in the course of less than two months, which improved the latency and reliability of the image service and unblocked projects on our new catalog API development,” noted Uber Software Engineer Kristoffer Andersen in a blog post.

The post came had an air of familiarity. It was hard to draw the full conclusion at first because the tech breakdown was so interesting but after a few reads, it was quite simple. It’s a hash structures problem straight out of the LeetCode code practice site.

The Prompt: Uber Eats has too many images on its storage. It wants to reduce storage costs and improve latency but can’t scrub images across vendors with the image system currently in place. Implement a logic and caching system that not only removes duplicates across vendors but also alleviates the need to process images more than once.”

The Old System: The old system saved a lot of photos of Diet Coke and it sounded like a nightmare for the less tech-savvy restauranteers that rely on Uber Eats. Every upload required a full menu upload and the system knew a certain image needed to be re-downloaded by a required URL change. This kept everything very store and merchant specific. Imagine needing to change one image, uploading the whole menu, only to realize that the URL wasn’t changed.

The High-Level Solution: After an image URL is uploaded by a vendor, implement logic that determines if the URL exists in the database by checking against the images in the database. If the image is new, download and make sure it’s a valid image then reformat and scale to the standard size. Store the final processed image.

High-Level Control Flow Logic and Maps for Deduplicating Images

The solution Uber Eats implemented was deployed across the system rather than on a per vendor basis. The control flow logic was broken down into three main flows which are illustrated in the diagrams below.

Control Flow Logic 1: Known, processed image: This is the simplest flow as it returns previously stored values. It goes through the two yellow boxes.

Control Flow Logic 2: New, unprocessed image: The image is downloaded and processed in this flow. This is illustrated by the two green boxes.

Control Flow Logic 3: Known, unprocessed image: This flow covers the cases where the image is known but has not been processed with the requested specification. This flow is seen on the chart in the yellow box to the left and the orange box to the right.

The maps working alongside the logic are as follows:

Detailed Breakdown of the Control Flow and How the Maps Work Together

The URL map returns the hash of the image. The hash and the processing specification (input requirements, output format, and size) return the processed image. The original image map returns the original image based on the hash of the image.

When image processing is initiated with an image URL and a processing specification, the first step is to check if the URL is already downloaded.

  • Yes? Hash of bytes from the original image map is read.
  • No? Create it! Download the image, update the URL map, and store the original image in the system.

Then both the hash of the image and the processing specification are used. Has the image been downloaded and processed before?

  • Yes? Return the processed image’s URL to the requestor.
  • No? Create it! Process the image, store it, and return the image’s URL to the requestor.

Client-side image errors are also stored inside the processed image map as well. An example of this would be if an original image doesn’t meet the minimum allowed size. That way if ever the same URL comes in with the same error, the error is immediately returned.

Rewriting the Image Value on an Unchanged URL

An additional piece of control flow logic built into this updated image system is how to update an image on a URL that isn’t updated.

Uber Eats employs logic that reads the value of the last-modified HTTP header on these incoming images to determine whether or not it needs to be downloaded again. If the value is newer then the value stored in the URL map, then the image must be downloaded again. If not, then nothing changes. If the image is downloaded again, the URL map is updated with the new hash of the image and the new last modified.

In Conclusion

This caching layer is currently running and has been successful for the last four months. Uber’s system processes millions of images daily which is amazing when compared to the 1% of server calls specified. The benefit of the if/then flow along with the hash map architecture was so clearly illustrated here in this article that it shines a bright light at the end of the LeetCode tunnel.

Group Created with Sketch.
THE NEW STACK UPDATE A newsletter digest of the week’s most important stories & analyses.