Understanding Database Consistency Levels And Applying Them To A Single Web Service
In my former newsletter post, I discussed how we can distribute our database across different cloud regions to manage the load and latency of our service when being hit with concurrent global traffic in the scale of millions.
When the database nodes are spread across different cloud regions and availability zones across the globe, a trade-off arises between consistency and availability (CAP Theorem) and low latency and consistency (PACELC theorem: an extension to the CAP Theorem).
In this post, I will discuss different database consistency levels with their application in a real-world messaging service like Discord.
Database Consistency Levels
Two standard consistency levels that primarily all distributed databases support are Strong and Eventual consistency. Besides these two, some databases offer other tunable consistency levels, such as causal consistency, read-your-writes consistency, monotonic read consistency, monotonic write consistency, session consistency, etc.
I'll begin the discussion with Strong and Eventual consistency and will then get into other consistency levels.
Strong Consistency
A service, system or an application is said to be strongly consistent when all the users of the service in different cloud regions globally see the same value of an entity at any point in time.
Strong consistency is crucial in several use cases like online trading, online booking systems, banking use cases, healthcare, and so on.
Now, for instance, in our messaging service, if we need to ensure all the users across the globe in a certain channel or a group see consistent message edits or deletions at any point in time for a seamless user experience, we need to make the functionality strongly consistent.
Without strong consistency, users might experience discrepancies in message content. Some nodes will show the edited message, while others may show the original unedited version, with users ending up having different interpretations of their conversations at a point in time.
Strong consistency guarantees that all users, regardless of their cloud region or the node they are connected to, see the same version of edited or deleted messages.
This will happen when the system acknowledges or reflects edits only when the edit is successfully replicated to all the distributed nodes of the system across cloud regions.
Ensuring strong consistency in a system implemented across cloud regions is a little tricky since this involves consistency, latency and availability trade-offs.
In this scenario, ensuring strong consistency in the system globally means the user edits will only be displayed once they are replicated across all the cloud regions. But this replication will take some time (aka replication lag), which means all the users will be looking at message edits only after a while, not in ultra-low latency, without any delay.
Also, if there were an entity that multiple users were editing concurrently in real-time as opposed to just one, things would get more tricky. In this scenario, we would have to restrict writes to a single/few cloud regions or only to some availability zones to ensure strong consistency.
Eventual Consistency
If the same message editing feature were eventually consistent, then after the edit, users of the same cloud region would immediately see the updated message, while the users of distant cloud regions would see the updated message after the edit is replicated across nodes in those cloud regions after a short while. Meanwhile, they will see the stale value of the message since the database nodes are not locked for reads to ensure strong consistency.
In this case, different users in a certain channel or a group will see different versions (original or edited) of a certain message at a point in time. The message content will eventually be consistent globally after replication.
However, eventual consistency enables the system to be available at the cost of consistency. For instance, if multiple users were updating the same entity concurrently, they could do that without the system needing to lock down nodes to ensure strong consistency or moving the writes to a certain cloud region.
In the case of strong consistency, we have to lock down the nodes to avoid reads and writes or queue the writes based on the use case when the system is replicating data globally. Users cannot perform writes in real-time until the replication is complete.
Going through my distributed database post is recommended for a deeper understanding.
Causal Consistency
Causal consistency preserves the order of updates made to an entity. Imagine we have a certain message in our channel on which we have several replies.
Preserving the order of replies is essential for the user experience because a certain reply made to that message may lead to more reactionary replies. Mixing the order of replies would leave the users confused.
This feature needs to be causally consistent. It guarantees that the order of the messages and their replies will be preserved in our messaging channels.
If you are wondering how do distributed systems maintain the ordering of messages, there are various techniques and approaches, such as leveraging vector clocks, logical clocks, distributed commit algorithms, quorum-based systems, and so on.
Vector clocks are a common technique to track the ordering of events across the nodes in distributed systems.
Read Your Writes Consistency
Read your write consistency ensures if a user has performed a write in the system, they will always see that write when performing a read after the write.
But why won't a user not see their write immediately after doing it?
In master-replica DB setup, read replicas reduce the read load on the DB by handling the read requests, whereas all the write operations happen on the master node.
When the user performs the write operation, the write happens on the master node and the updates are replicated to the read replica nodes either synchronously or asynchronously. The time it takes to update the replica from the master node is called the replication lag.
During this time, if the user's read request hits a certain replica, they may not see their latest write.
But when the system ensures read-your-write consistency, the user always sees their writes. The system updates the read replicas synchronously when the write operation happens to ensure all the read replicas are strongly consistent with the master node. However, this adds a bit of latency to the system since the write operation is not considered complete unless the read replicas are updated. So, the response latency will have an additional replication lag time added to it.
When the system is not read-your-write consistent: in our messaging service, when the user edits a message, they may not see their edits immediately after the edit write operation. Or if the user joins a certain channel, they may not see themselves as a member of the channel immediately due to the replication lag. This may make the user perform the write events again or leave them confused, which is not a desired user experience.
Monotonic Reads Consistency
Monotonic read consistency ensures that when a user sees a certain value of an object or an entity after a read operation, they always see either the same or newly updated versions of that entity (if available) on subsequent reads. They will never see any earlier value of that entity than what they have already seen.
Applying this consistency level to our messaging channel use case ensures that the users will always see the same or updated versions of the messages, enjoying a consistent user experience.
Similarly, if a user reads a message before it is deleted, any subsequent reads by that user will not return the message rather the updated value of that message. The user will consistently see the message as deleted.
Monotonic Writes Consistency
Just like monotonic reads, the monotonic writes consistency level always ensures that the writes made by a user are applied in the same order in which they were issued and the user always sees the latest value of an updated entity.
Applying this consistency level to our messaging channel use case ensures that when a user edits a message multiple times with different values, they always see the updated values of the message every time in the correct order as edited, having a consistent user experience.
Isn't the monotonic read-write consistency model similar to the causal consistency we discussed before? Both maintain the order of operations, right?
Difference Between Monotonic Read-Write and Causal Consistency
The difference is subtle. Monotonic read-write consistency preserves simple order of operations for a single process or user. In contrast, causal consistency goes beyond the simple order of operations, ensuring the causally related operations maintain their order across all processes or users in the system. This allows unrelated operations to have a flexible order as long as the causally related operations are observed in the correct order.
So, for instance, in a monotonic consistent system, if a user sends three messages: A, B & C. The system will ensure the user sees the messages in the same order (A, B, C).
In a causally consistent system, if a user sends a message A and another user replies to it with message A1, and then the user sends a message B, the system will ensure that all users will observe message A and A1 before message B, maintaining the causal relationship. The causal messages will be given preference in the order.
Monotonic consistency is typically discussed in the context of events performed by a single user. In contrast, causal consistency is addressed in the context of all the users in the system.
Session Consistency
The session consistency model guarantees consistent order of operations performed by a user within a single user session. It does not guarantee the order of operations outside a user session or at a global level.
So, for instance, if a user edits a message multiple times within a single user session, they would see the edits consistently in the order they were made.
Another use case for this in our messaging service is chat sessions. Applying session consistency to users' chat sessions ensures the correct order of messages within a specific user's chat session. This enables users to follow a consistent conversation flow.
The system can assign a unique sequence number or timestamps to the user's messages for consistent ordering within the session. When a user retrieves their chat history, the messages are displayed in the sequence based on the sequence number or timestamps.
Since the chat involves multiple users, the ordering of messages in session consistency is guaranteed only within respective user sessions. Different users may have different message orders due to network delays or variations in message delivery by the message broker. So, different users may see slightly varied message ordering.
A real-world complex distributed service may have varying consistency requirements based on the service needs and different use cases. This is where different consistency levels come in handy. We may have scenarios where we may want to provide stronger consistencies within a user session and a bit relaxed consistency at a global level. With different consistency levels, we can optimize our application consistency.
The choice of consistency levels in a real-world service deployed across different cloud regions is nuanced and often involves considerations specific to each cloud region or availability zone. Different cloud regions may have different latency requirements, network conditions, etc., leading to variations in the consistency guarantees provided.
If you wish to delve into how distributed databases work, how large-scale service manage their database growth, how they deal with global concurrent traffic and distributed data conflicts, including vector clocks, how distributed services are deployed globally and much more, check out the Zero to Software Architecture Proficiency learning path.
It's a series of three courses authored by me intended to help you master the fundamentals and the intricacies of designing distributed systems like ESPN, Netflix, YouTube, and more.
Additionally, if you wish to learn to code distributed systems from the bare bones, I am running a series on it in this newsletter. Do check it out here.
If you wish to practice coding distributed systems like Redis, Docker, Git, a DNS server and more from the bare bones in the programming language of your choice, check out CodeCrafters (Affiliate). With their hands-on courses, you not only gain an in-depth understanding of distributed systems and advanced system design concepts but can also compare your project with the community and then finally navigate the official source code to see how it’s done.
You can use my unique link to get 40% off if you decide to make a purchase.
If you found the content insightful, do share it with your friends for more reach and consider subscribing to my newsletter if you are reading the web version of this newsletter post.
You can get a 50% discount on my courses by sharing my posts with your network. Based on referrals, you can unlock course discounts. Check out the leaderboard page for details.
You can find me on LinkedIn & X and can chat with me on Substack chat as well.
I'll see you in the next post. Until then, Cheers!