System Design Case Study #5: Serverless Compute & Storage At the Edge With Stateless & Stateful Functions

Feb 04, 2024

Picture a scenario where we need to set up the inventory management infrastructure globally across different cloud regions for a massive sports apparel company.

The business operates with a localized approach where the products are tailored based on local preferences and demands based on the respective cloud region. Since the products are consumed locally, it's a good idea to manage the inventory locally in different cloud regions as opposed to streaming the data to a central cloud region from respective cloud regions.

This means we need cloud region-specific deployments to manage the local inventory. However, we may also need to stream some data to the central cloud region for overall aggregated inventory analytics and such.

I have discussed independent cloud region-specific deployments in my earlier post: Distributing our database in different cloud regions globally to manage load & latency. If you haven't read it, it's a recommended read.

This post focuses on serverless deployments and the intricacies involved in managing state in event-driven serverless services.

To begin with, why do we even need serverless compute and the serverless database? Why not move forward with a conventional API-driven backend?

Why Serverless Compute & Serverless Database In Cloud-Region Specific Deployments

Because our use case is event-driven. Only when a product is added or updated in the inventory, a set of operations get triggered. We do not have to run our servers all the time. This will save us significant compute costs.

For instance, when a new product is added to our inventory, it triggers a set of operations like the product image goes into the S3 object store, the description, price, count, etc. is processed by the serverless function, the system may further resize the image, extract metadata and so on. Eventually, the product data is stored in the local cloud-region serverless database.

If it weren't for the serverless, we would have to run our servers continually, which would require ongoing infrastructure provisioning and management, in addition to the associated idle server running costs.

Serverless infrastructure enables us to focus on code and on implementing business logic as opposed to worrying about scaling and managing the underlying infrastructure. Developers just write code and run it on serverless functions without worrying about how the backend will scale when subjected to rising traffic.

With serverless functions, the compute only runs when required, i.e., when events get triggered. Moreover, our current use case does not require the backend to store any user state. The product upload process is mostly stateless. All this fits serverless functions and the serverless database well with our use case.

Now, let's look at our serverless architecture.

Serverless Architecture

The new product addition to the inventory event is handled by a serverless function, which then triggers the serverless database for the product data to be stored.

Local serverless databases at respective cloud regions help us keep track of product availability, which is helpful in managing restocking and optimizing the retail store shelves.

Also, as new products are added, updated and removed from the local serverless databases, certain data is streamed to the central cloud region asynchronously or synchronously for overall analytics.

The central database may contain the master inventory dataset, giving an overview of product availability in all the global stores and a centralized view of the entire inventory. This helps the business run analytics for demand forecasting and other related scenarios to understand the whole retail chain.

Cloud providers often charge for data transfer across regions, so we need to keep that in mind when designing our architecture, as well. This will be a key aspect in ascertaining what data goes into the central cloud region and what stays in local cloud region databases.

New System Requirements

The above serverless architecture enabled the production warehouses to update the product data in the inventory system. This gave the company insights into its inventory in real-time.

Now, we have a new requirement. Since all the inventory data is stored in local serverless databases, we intend to make it the single source of truth for the inventory information.

We need to integrate the inventory management system with the retail shops' IoT devices/systems as well. Whenever a purchase transaction happens in a local retail shop, the shop's system will trigger an event to be handled by the retail transaction serverless function and the function will update the product count in the same inventory serverless database.

This way, all the product purchases from the retail stores in specific cloud regions will update the available inventory count in the same inventory database, ensuring a consistent inventory. We do not have to setup a separate retail transaction inventory database and sync it with our production warehouse inventory database in real-time. This will keep things simple.

Our initial production warehouse to product inventory update use case was stateless, but in the retail purchase scenario, we need to store state with the serverless functions on the backend for efficient processing. For this, we would need stateful serverless functions.

Let's look at our retail transaction serverless architecture.

Retail Transaction Serverless Architecture

Our retail stores are equipped with IOT devices that trigger specific events for product sales, updating customer data, etc.

These events are handled by specific serverless functions on the backend. In these events, we need to store some state as well with the functions for efficient processing. This would help with managing long-lived transactions, managing orders based on customer preferences and such. Storing state with the serverless functions helps reduce the load on the database as well which improves application latency.

In case you are hazy on what application state is, check out the detailed post on application state - A discussion on stateless & stateful services (Managing user state on the backend), on my blog.

A general notion is serverless is stateless and best suits stateless use cases. But we can store state with them with stateful serverless functions as well.

Besides the inventory database, the retail store backend will have a separate database to manage the customer data, purchases and other information. In addition, synchronously or asynchronously, this information will be streamed to the central cloud region to enable the business to gauge its regional performance and customer behavior in different regions and compute overall analytics.

Separate Stateless & Stateful Serverless Functions

For the product updates from the production warehouse, we will have stateless serverless functions handling the events. For the retail product purchases, we will have stateful serverless functions handling the events.

The stateless functions will handle the stateless tasks like product addition or updation and will be short-lived. The stateful functions will manage the long-lived state spanning multiple requests and complex business logic.

Both types of serverless functions will have separate code and deployments since the use cases are different and the cloud platform can scale and optimize their performance separately based on the load.

Now, let's look into stateful functions in detail to understand them better.

Stateful Serverless Functions

Stateful functions ideally maintain the state across requests by leveraging external storage like a latency-optimized database or a key-value store. These stores are optimized for quick reads and writes focused on low-latency access to the state information.

In addition, some serverless cloud products provide in-memory storage as well for functions to store state during a request. This averts the need to interact with the database every now and then for state info, thus improving system efficiency.

So, to manage state across requests, we have external storage, and to manage state temporarily within the span of a request, we have in-memory storage.

Some serverless solutions have built-in storage with serverless functions as well to manage the state, offering them an efficient way to work with the state.

The storage contains some form of unique request attribute, such as client ID, authentication token, etc. that enables the serverless platform to recognize requests from the same client, thus associating the request with the stored state.

Stateful serverless functions are an extension to the existing serverless infrastructure, enabling them to run orchestrated workflows, work across multiple requests, and so on. For instance, Apache Flink stateful functions are built on top of Apache Flink and leverage several Flink's features. For consistent state guarantees, including fault-tolerance and scalability, they use the technique of co-location of state and messaging in the cluster, inherently used by Apache Flink.

With Flink's stateful functions, the state is managed by the stateful functions themselves and shared with other stateful functions when required. Devs do not have to worry about manually storing the state in an external database. Flink handles this for us.

Similarly, Azure Durable stateful functions are an extension to Azure Functions. The state durability of stateful functions enables devs to run complex workflows, long-running tasks, etc.

We learned that state can be added to serverless functions via external storage but why add state to a serverless function which is built to be inherently stateless? Again, why not deploy a conventional API-driven backend instead?

Serverless vs. Conventional API-driven Backend

Picking a conventional API-driven backend or a serverless architecture largely depends upon the business use case, application characteristics, requirements, and complexity. Distributed system design is nuanced. There are so many factors and trade-offs to consider when designing a scalable and available system.

However, in our case, the first reason to pick the serverless architecture and then adding state to the functions is we have an event-driven architecture. Our service does not run all the time, like in any other use case, for instance, a video streaming service where the users are active on the website most of the time and we have to keep the servers running.

Our stateful serverless functions run only when a retail purchase happens. So, we are saving money on the servers not running idle most of the time, plus we do not have to invest resources in managing the infrastructure and optimizing it to scale with the increased load. Everything is handled by the cloud provider. There is no operational overhead.

Additionally, stateful serverless fits well with short-lived stateful computations. In contrast, if the computation were long-running, complex and less event-driven, a conventional API backend would be a better fit.

An example of this would be web sockets implementation with a persistent connection to the backend. This sort of use case would be challenging to implement with serverless functions.

However, at the same time, it's essential to know that since the serverless infrastructure is managed by the cloud provider, we are vendor locked-in with them. Adapting our code to the new complex requirements may get tricky to the point of rewriting several modules of the service from the bare bones since we have minimal control over the infrastructure.

The best bet for deciding the right architecture and technology stack for our use case is to do a POC (Proof Of Concept) and benchmark it.

I wrote a blog article earlier on how Discord scaled their member update feature benchmarking different data structures. It is a good read. Check it out.

If you wish to learn the fundamentals of distributed system design, including concepts like serverless, running complex workflows, vendor lock-in, cloud infrastructure, how large-scale services are deployed across different cloud regions and availability zones globally, fundamentals of web architecture, how to pick the right technology for your use case and more, check out the Zero to Software Architecture Proficiency learning path.
It's a series of three courses authored by me intended to help you master the fundamentals and the intricacies of designing distributed systems like ESPN, Netflix, YouTube, and more.

Additionally, if you wish to learn to code distributed systems from the bare bones, I am running a series on it in this newsletter. Do check it out here.
If you wish to practice coding distributed systems like Redis, Docker, Git, a DNS server and more from the bare bones in the programming language of your choice, check out CodeCrafters (Affiliate). With their hands-on courses, you not only gain an in-depth understanding of distributed systems and advanced system design concepts but can also compare your project with the community and then finally navigate the official source code to see how it’s done.
You can use my unique link to get 40% off if you decide to make a purchase.

If you found this newsletter post helpful, consider sharing it with your friends for more reach.

If you are reading the web version of this post, consider subscribing to get my posts delivered to your inbox as soon as they are published.

You’ll find the previous system design case study here:

System Design Case Study #4: How Shopify Implemented Read Consistency Across Their Database Replicas For A Consistent User Experience

Shivang Sarawagi

January 27, 2024

System Design Case Study #4: How Shopify Implemented Read Consistency Across Their Database Replicas For A Consistent User Experience

Shopify leverages DB replication for redundancy and failure recovery, in addition to setting up read replicas as an alternative read-only data source for read operations. This reduces the read load on their primary database nodes as the read requests can be routed to the read replicas and the primary nodes can have more bandwidth to handle more write op…

Read full story

You can get a 50% discount on my courses by sharing my posts with your network. Based on referrals, you can unlock course discounts. Check out the leaderboard page for details.

You can find me on LinkedIn & X and can chat with me on Substack chat as well. I'll see you in the next post. Until then, Cheers!

Web Scale (Systems Architecture & Systems Programming)

System Design Case Study #4: How Shopify Implemented Read Consistency Across Their Database Replicas For A Consistent User Experience

Discussion about this post