System Design Case Study #2: Building An Image Processing Pipeline, Weeding Out Duplicates With Content Addressable Storage & How Uber Eats DeDuplicates Their Images

Jan 02, 2024

Picture a scenario where we need to build a product or a food item image processing feature for our food aggregator app like Swiggy, Zomato or Uber Eats. The feature should enable restaurants to upload images of food items, which are then processed (deduplicated, compressed, and stored) on the backend to be displayed to the end users.

Since our business is really taking off (OMG :)), the image uploads have grown to the scale of millions per hour and the processing of those images has to happen on the backend accordingly. Churning out duplicate images is a key functional requirement to cut down on processing and storage costs.

How would we build such a system?

One way to build a system like this is, update the core backend service. Create an API endpoint that facilitates image upload, which then processes the images on the application server and stores the image metadata in a database and the image in cloud storage.

But we are dealing with a system where the restaurants would upload and update images concurrently in the scale of millions per hour, in real-time. Also, we have specific business rules to process the images. New requirements could be added in the near future.

For this, we should build a dedicated image-processing pipeline that ingests images in real-time, processes them, and takes further actions like deduplication, compression, storing them in a database, cloud storage, or whatever based on the defined business logic.

Image Processing Pipeline

Why the need for a dedicated image processing pipeline as opposed to coding the feature in the existing core service?

Because the scale of image uploads and updates is in the millions concurrently in real-time. This feature is both CPU & IO-intensive, requiring heavy compute and horizontal scalability. Coupling it with the existing core backend service can make things complex and exert unnecessary load on the servers running the core application modules.

It's advantageous to setup a dedicated service that would handle image processing in real-time. This will help us scale different services separately based on the requirements.

To process the images, we can set up an image processing pipeline using a data/stream processing framework like Apache Flink along with a scalable event queue like Kafka.

Why Flink?

Apache Flink is a real-time stream-processing framework that can process both bounded and unbounded data streams in real-time at in-memory speed with high throughput at web scale. Flink can horizontally scale across a cluster of machines, enabling it to handle large datasets with high throughput.

It fits our use case. Flink's ability to scale horizontally enables our system to handle large volumes of image-processing tasks concurrently.

Moreover, if we were already using Kafka in our architecture, we could consider Kafka streams. If the images were processed in batches as opposed to in real-time, Apache Spark could have been a strong fit.

If our core workload was hosted on AWS or GCP, we could have considered AWS Kinesis or Google Cloud Dataflow.

Ideally, picking the right streaming framework requires extensive research and a POC (Proof Of Concept) implementation. We need to consider different factors such as the programming models of the stream processing framework, programming language support, data storage support, reliability, state management, horizontal scalability and many more.
P.S. I've picked Flink just for the design discussion; it's not a definitive selection.

Back to our case study:

Kafka, in our system architecture, will act as an event queue (a Flink source) that receives the images uploaded by the restaurants and forwards them to Flink.

Image processing pipeline without image staging cloud storage

Why can't we directly stream the images to Flink? What's the need for Kafka?

Streaming images to Kafka and storing them there temporarily enables Flink to process images at its own pace. In case the Flink module has any issues and goes down, the users continue to upload images to Kafka uninterrupted.

Later, Flink can catch up when it bounces back. It can pull the images from Kafka, process them, and store the metadata in the database and the images in the cloud storage. This increases the fault tolerance, reliability and flexibility of the system. Additionally, the ingestion and processing modules of the system are separated, which is an implementation of the separation of concerns principle.

Additional Image Staging Cloud Storage For Modularity And A More Loosely Coupled Architecture

To make our architecture more modular, loosely coupled and scalable, we can separate the image upload, ingestion, processing and storage modules in our image processing pipeline by introducing additional cloud storage in the pipeline for temporarily storing image uploads before the event queue processes them.

The additional cloud storage acts as a landing zone for the uploaded images and forms the image upload module. The event queue or the message broker that is Kafka forms the image ingestion module. The data streaming framework Flink is the processing module, while the database that stores the metadata and the cloud object storage that stores the images form the storage module.

In this image-processing pipeline architecture, all the modules are separated and loosely coupled. Each module can be scaled individually when needed. We can further introduce moving data into a search component to make the product images searchable in the app. But let's focus on the current architecture. I'll have a dedicated post on the search component in scalable system architectures in the near future.

Once an image or a bunch of images get uploaded to the landing zone, the system will trigger an event for them to be streamed to the event queue, Kafka.

Though, there is one thing we need to bear in mind. Since our system ingests images in real-time via the API, having an image upload cloud storage that acts as a landing zone for images, might introduce some latency and complexity in the system.

Using additional cloud storage or not totally depends on the business requirements and our latency budget. If we do not need the additional latency of a dedicated image upload module, we can bypass it by directly pushing the images to the event queue.

So, this was a high-level discussion on building our image processing pipeline. We can go into more detail, but that's not really required right now. As I publish more system design case studies, I’ll continue to delve deeper into specific components of scalable system architectures, giving you a holistic view of working with scalable distributed services.

Now, let's discuss the image deduplication requirement. How do we ensure the images moving to the data storage are unique?

For instance, n restaurants can sell a can of RedBull, and all of them may upload the same can image provided by the company. In this case, why store the same image n times? Why not just store a single copy of the RedBull with all references pointing to the same image, thus saving storage and compute?

Content-Addressable Storage: Ensuring the Data Stored Does Not Have Duplicates

Content-addressable storage is a technique of storing data where, as opposed to saving it via file name or the file location, it is stored based on the unique hash generated by the content. This helps eliminate duplicate content, reducing storage space and costs significantly.

I've discussed content-addressable storage before on my blog in a case study: How GitHub indexes code for blazing-fast search and retrieval. You can give it a read.

In content addressable storage, the hashes are computed from the actual content of data with hash functions such as MD5, SHA-1, SHA-2, SHA-256, etc. Since the hashes are unique and change on data updation, this approach provides strong data integrity.

If the data changes over time, different hashes will be created, enabling us to access different versions of data while maintaining a version history.

CDNs use this technique to efficiently cache and distribute data in edge regions worldwide. GitHub leveraged this approach to reduce 115TB of data to 28TB of unique content.

Hash-based content addressable storage is a great way to churn out duplicate data when storing massive volumes of data. And we can leverage this in the processing module of our image processing pipeline to weed out duplicate content, keeping only the original data in the data storage.

Let's peek into how Uber Eats deduplicates and stores images in their system efficiently using the same technique.

How Uber Eats Deduplicates And Stores Images In Their System Efficiently

Uber Eats handles product image updates in the scale of several hundred million every hour. To keep the image processing, storage and CDN costs down, Uber Eats uses content-addressable storage to detect duplicates in their image processing pipeline.

Img src: Uber

To manage the images, they have three maps (key-value pairs): The original image map, the processed image map and the image URL map.

The original image map has key as the generated hash of the image and the value as the original image.

The processed image map contains the key as the hash of the image along with the processing specification and the value is the processed image.

And the image URL map has the key as the image url and the value as the hash of the image.

Since the original image map and the processed image map contain the images, they are stored in a blob store called Terrablob, which is similar to AWS S3 (an object storage service). Whereas, since the URL map contains only the metadata it goes into the database called Docstore, which is Uber's multi-model database.

When an image gets uploaded to the pipeline, the system first refers to the image URL map to check if the image url already exists. If yes, the image hash is read, and then the processed image map is referred to check if the image has been processed. If yes, the system knows the image is duplicate and needs no processing, thus saving compute and storage space.

If the image is not processed, the system processes it and updates all the required maps. If the url is not present, the system processes the new image and updates all the maps.

Here is their image processing system architecture flow chart:

Img src: Uber

There are a bunch of possible combinations when an image is uploaded to their system. Also, they are keeping both the original uploaded image and the processed image in their storage. You can refer to this article for details. But you got the idea of how we can build a data streaming system that would churn out duplicates for system efficiency and reduced storage costs.

If you want to learn to design large-scale distributed systems from the bare bones, along with the discussion on the fundamental concepts starting right from zero, check out the Zero to Software Architecture Proficiency learning path.
It comprises three courses I have authored intending to educate you, step by step, on the domain of software architecture, cloud infrastructure and distributed system design.

Additionally, if you wish to learn to code distributed systems from the bare bones, I am running a series on it in this newsletter. Do check it out here.
If you wish to practice coding distributed systems like Redis, Docker, Git, a DNS server and more from the bare bones in the programming language of your choice, check out CodeCrafters (Affiliate). With their hands-on courses, you not only gain an in-depth understanding of distributed systems and advanced system design concepts but can also compare your project with the community and then finally navigate the official source code to see how it’s done.
You can use my unique link to get 40% off if you decide to make a purchase.

Here is the previous case study if you haven’t read it yet:

System Design Case Study #1: Implementing Caching In A Distributed Architecture. How DoorDash Did It In Their Microservices Architecture
Shivang Sarawagi
·
December 27, 2023
Picture a scenario where several microservices in our distributed system architecture leverage cache for performance, to reduce database hits and to lower the operational costs. Every microservice team uses caching libraries, for instance, Redis, Caffeine, etc., directly in their code without a standardized interface. This tight coupling of external tec…
Read full story
Implementing Caching In A Distributed Architecture. How DoorDash Did It In Their Microservices Architecture

You can get a 50% discount on my courses by sharing my posts with your network. Based on referrals, you can unlock course discounts. Check out the leaderboard page for details.

If you reading the web version of this post, consider subscribing to my newsletter.

You can find me on LinkedIn & X.

I'll see you in the next post. Until then, Cheers!

Y Chow

Jan 10, 2024

Gem!! Thank you sir!!!

Expand full comment

Ndoh Joel

Jan 2, 2024

Hello Shivang,

Thanks for this post and your previous one.

However, I have one question,

When the merchants upload the image into the "Image upload and ingestion module", are they uploading the url because from your explanation it seems like Kafka only sends a url to the "Image Processing module".

If yes, why is the deduplication not done on the server where the image blob is initially uploaded to and the url is generated.

If no, how is the "Image Processing module" receiving the url. I thought the users are uploading their data directly with the use of APIs on the "Image upload and ingestion module"

3 replies by Shivang Sarawagi and others

5 more comments...

Web Scale (Systems Architecture & Systems Programming)

System Design Case Study #1: Implementing Caching In A Distributed Architecture. How DoorDash Did It In Their Microservices Architecture

Discussion about this post