System Design Case Study #1: Implementing Caching In A Distributed Architecture. How DoorDash Did It In Their Microservices Architecture

Dec 27, 2023

Picture a scenario where several microservices in our distributed system architecture leverage cache for performance, to reduce database hits and to lower the operational costs.

Every microservice team uses caching libraries, for instance, Redis, Caffeine, etc., directly in their code without a standardized interface. This tight coupling of external tech with the local code not only makes the overall system code messy but also provides minimal overall control and observability of cache implementation in our system.

But why do we need control and observability over the cache implementation in our system architecture?

It's critical that the cache stays in sync with the original data source. If the data goes stale, based on the use case, it can break the business logic. Different business use cases have different staleness tolerance. Fixing issues arising out of data staleness can be time-consuming and complex.

If we have control over the cache implementation in real-time, we can turn the cache on and off in our system without any code redeployment. If the cache needs tuning or any adjustment, we can do that as well across our system architecture.

In addition, we can route a traffic percentage to the original data source and compare the results with the cache consistently to ensure the cached data is always consistent with the data source. This technique is called cache shadowing.

We can further study the cache hit rates, error percentages, and such for further observability. Keeping tabs on this data helps in coming up with an effective cache invalidation strategy.

Having control over the cache implementation in our distributed architecture helps us bail out on the existing implementation and plug in a different caching tech altogether when needed.

This is primarily possible if we use a standardized caching interface in our code as opposed to tightly coupling the cache directly.

Caching In Microservices Architecture At DoorDash

DoorDash faced a similar issue when using Caffeine as a local cache and Redis as a distributed cache in their microservices architecture.

Most microservices teams at DoorDash directly plugged these libraries into their code, which made things messy. To tackle this, DoorDash developed a single caching interface and a multi-layered caching system for cache implementation across their system architecture.

The first layer is the request local cache powered by a simple HashMap that contains data only for the lifetime of the request. The request cache is the cached response for the initial request.

The second layer (Caffeine implementation) is scoped across the JVM, where the cache data is visible to all workers in a single JVM. The DoorDash article doesn’t specify what workers mean here explicitly. I reckon it’s the worker threads running in a JVM, not the application instances.

The third layer contains the cached data for all the pods interacting within the same Redis cluster. A pod contains one or more containers running on a single node in a Kubernetes cluster.

This multi-layered caching system could be integrated transparently with every microservice with minimal disruption. In this system, the cache request progresses through the layers until the value to the key is found. If the value is retrieved from a later layer, it is stored in the earlier layer for faster access to subsequent requests.

To measure cache performance, the cache hit and miss ratio is recorded. To ensure data freshness, the cache shadowing mechanism is implemented that for a percentage of cache reads invokes the fallback layer of a certain cache layer and compares cached and fallback values for equality.

This is done on an ongoing basis and metrics on successful and unsuccessful matches are graphed and alerted. This data is key for building an effective cache invalidation strategy.

In the multi-layered caching system, based on the requirements, individual caches can be turned off by setting their TTL to zero. For use cases that could tolerate a degree of cache staleness, all the caching layers are leveraged and for cases where data consistency is super important, a few or all layers are turned off to enable the requests to directly hit the database or the single source of truth.

Key System Design And Backend Engineering Lessons From This Case Study:

1. Always implement an abstraction layer when integrating third-party code with your code. Tightly coupling third-party code with our code isn't a great idea. It makes the code messy and also prevents us from bailing out on a technology when required.

Here is a quick code example for this:

Let's say we intend to integrate multiple caching libraries like Redis and Caffeine into our code. We will create a CacheService interface as opposed to directly using the Redis or Caffeine code in our classes.

The CacheService serves as an abstraction that averts the need for significant code refactoring if we need to switch to a different caching tech.

public interface CacheService {    
   String getValue(String key)    
   void setValue(String key, String value);
}

We will have separate classes for every cache library implementation.

@Service        
public class CaffeineCacheService implements CacheService {   
    private final CaffeineCache<String, String> cache;  
 
    public CaffeineCacheService() {        
    this.cache = //code for initializing Caffeine
}     

@Override    
public String getValue(String key) {        
   //get value from the Caffeine cache & return
}

@Override
public void setValue(String key, String value) {        
   //set value in the Caffeine cache        
 }    
}

@Service
public class RedisCacheService implements CacheService {    
    private final RedisCache<String, String> cache;
    
    public RedisCacheService(RedisCache<String, String> cache) {        
    this.cache = //initialize Redis cache;    
}

@Override
public String getValue(String key) {
    //get value from Redis cache and return
}

@Override
public void setValue(String key, String value) {
   //set value in Redis cache
 }
}

Now, if I want to use these caches in my DataService class, the class will interact with the CacheService interface as opposed to directly interacting with the Redis and Caffeine code.

@Service
public class DataService {
    private final CacheService caffeineCacheService;
    private final CacheService redisCacheService;

@Autowired
public DataService(CacheService caffeineCacheService, CacheService redisCacheService) {
        this.caffeineCacheService = caffeineCacheService;
        this.redisCacheService = redisCacheService;
}

public String getCaffeineCachedData(String key) {
        return caffeineCacheService.getValue(key);
}

public String getRedisCachedData(String key) {
        return redisCacheService.getValue(key);
 }
}

2. The ability to configure a tech implementation with minimal rollbacks and redeployment, including infrastructure observability, are super important for the reliability, availability and scalability of our system architecture. We need to keep this in mind when designing and implementing our architecture.

If you wish to take a deep dive into the fundamentals of designing a large-scale service, check out the Zero to Software Architecture Proficiency learning path comprising three courses I have authored intending to educate you, step by step, on the domain of software architecture, cloud infrastructure and distributed system design.

This learning path offers you a structured learning experience, taking you right from having no knowledge on the domain to making you a pro in designing web-scale distributed systems like YouTube, Netflix, ESPN and the like. Check it out.

Additionally, if you wish to learn to code distributed systems from the bare bones, I am running a series on it in this newsletter. Do check it out here.
If you wish to practice coding distributed systems like Redis, Docker, Git, a DNS server and more from the bare bones in the programming language of your choice, check out CodeCrafters (Affiliate). With their hands-on courses, you not only gain an in-depth understanding of distributed systems and advanced system design concepts but can also compare your project with the community and then finally navigate the official source code to see how it’s done.
You can use my unique link to get 40% off if you decide to make a purchase.

If you found the content insightful, do share it with your network for more reach and consider subscribing to my newsletter.

You can read the previous system design case studies on my blog:

Exploring Slack’s Real-time Messaging Architecture 

How Discord Scaled Their Member Update Feature Benchmarking Different Data Structures 

How GitHub Indexes Code For Blazing Fast Search & Retrieval

Why Stack Overflow Picked Svelte for their Overflow AI Feature And the Website UI

How WalkMe Engineering Scaled their Stateful Service Leveraging Pub-Sub Mechanism

In-Memory Storage & In-Memory Databases – Storing Application Data In-Memory To Achieve Sub-Second Response Latency

I’ll see you in the next post. Until then, Cheers!

Web Scale (Systems Programming & Systems Architecture)

Discussion about this post