Distributed Programming Part 1 - Coding Distributed Systems from the Bare Bones
Hello there, welcome to this introductory post on distributed programming. You can consider it as 101, our first step into the realm of distributed systems programming.
Distributed Systems, Distributed Services, Distributed Programming and Application Development
In the past, as a developer, I've had the opportunity to work on several large-scale distributed services, receiving millions of hits every single day. I spent quite some time of my life building components of these services from the bare bones in addition to supporting them in production. However, I haven't delved much into, or come to think of it, not one bit into distributed systems programming.
What do the terms distributed services, distributed systems, and distributed programming mean?
Distributed services are highly available, reliable, and scalable services such as e-commerce platforms like Amazon.com, movie or music streaming services like Netflix and Spotify, online food ordering delivery platforms like Uber Eats, Swiggy, Zomato and so on, which are used by millions of users at any point in time.
Distributed systems, on the other hand, are systems like Kafka, MongoDB, Redis, Google Cloud Spanner, Apache Spark, and so on, which are leveraged by distributed services to achieve the desired functionality, including scalability, availability, and fault tolerance.
Distributed systems form a network of nodes that communicate with each other in clusters across different cloud regions, availability zones, and data centers to achieve a common goal. Distributed services leverage these systems to add more nodes or resources to their clusters or infrastructure to handle the increased traffic load at peak times. These systems ensure the services withstand increased load without experiencing any downtime or performance bottlenecks.
Distributed systems provide mechanisms for redundancy and fault tolerance by replicating data across multiple nodes, ensuring high service availability. Even if a few nodes fail in the cluster, these services remain unaffected. In addition, these systems implement techniques such as replication, data consistency protocols, consensus algorithms, and more to ensure distributed services remain fault-tolerant, maintaining data integrity.
Building or coding distributed systems is called distributed programming while coding distributed services is commonly known as application development.
The terms distributed services and distributed systems are used interchangeably colloquially in the industry, but we are now aware that there are subtle differences in their meaning. One instance of this is the use of the term 'distributed system design'.
Distributed system design, or system design in general, is a popular term in the software industry, given that it is an essential part of most software engineering interviews. However, these distributed system design interviews mainly entail questions on designing distributed services as opposed to designing distributed systems. The questions typically revolve around designing scalable, available and reliable software architectures for specific service functionalities.
Now, let's have a quick look at the distributed service infrastructure stack, which will make the distinction more clear.
Distributed Service Infrastructure Stack
The top layer in the stack is the application software layer, which holds the application or the distributed service, for instance, Spotify. The application layer is followed by the cluster infrastructure layer that holds distributed systems like distributed databases, message brokers, distributed caches, distributed file systems, cluster management systems like ZooKeeper, and so on.
These systems enable distributed services running on top to scale across hundreds of thousands of nodes running across clusters. This layer is accessed by the distributed services via APIs provided by respective systems. The application programmer typically does not typically have to worry about knowing the intricacies of how these systems auto-scale under the hood by dynamically adjusting the size of the clusters, how tasks are preferably distributed to the less occupied server nodes, how low-level complex memory operations are executed, how the system ensures fault tolerance, and so on.
The cluster infrastructure layer is followed by the platform software layer, which holds the operating system, device drivers, firmware, etc. This layer provides an abstraction enabling the distributed systems to interact with the underlying hardware they run on.
The platform software layer is followed by the physical infrastructure layer, which consists of the compute servers mounted in a server rack, physical storage consisting of the hard drives and Flash SSDs, the RAM attached to the compute servers, and the data center network that connects all the physical infrastructure.
There is another vertical layer called the stack monitoring layer that applies to all the layers of the distributed service stack. This layer contains monitoring software or systems that ensure good service health and continual uptime.
Monitoring the stack, including keeping track of the server uptime, performance, latency, throughput, resource consumption, etc., helps in weeding out bugs, getting rid of system bottlenecks, optimizing the resource consumption, making the code efficient, and more. This is vital in building and running highly-performant software.
All the infrastructure stats are ideally monitored with respective tools powered by their web-based dashboards. Grafana, an open-source monitoring tool, is one good example, followed by Google Cloud Stackdriver monitoring, a cloud-based monitoring tool, and so on.
Check out my earlier newsletter post on observability-driven development where I've discussed this in detail.
Distributed Computing
I'll quickly want to bring in another term, distributed computing, which is a broader concept that encompasses both distributed systems and distributed services with the entire range of related technologies, principles, methodologies, and best practices.
Now, with the distinction between the terms clarified, let's delve further.
Kickstarting the Journey to Code a Distributed System from the Bare Bones
Given that I had zilch experience with coding distributed systems and it would be a fun learning experience, I decided to code a distributed message broker, like a Kafka prototype, from the bare bones.
I started my research and figured out multiple concepts or topics that needed to be covered in a sequential or interconnected manner to be able to implement a distributed system. I am listing them out below with insights into how they are applicable when building a distributed system to serve as a reference, a learning path, or a roadmap (whatever you wanna call it).
Coding Distributed Systems Learning Path/Roadmap
1. Distributed Systems Fundamentals
Having an understanding of fundamental distributed systems concepts such as data consistency, partition tolerance, availability, CAP Theorem, scalability, consensus algorithms like Raft, Paxos, etc., mechanisms for nodes to discover each other in a distributed environment, distributed node coordination, replication, distributed transactions, handling concurrent requests, distributed systems patterns like publish-subscribe, circuit breaker, etc., are crucial for coding a distributed system from the bare bones.
These are foundational concepts that will be applicable when implementing a scalable, available, reliable, and resilient system. If we take Kafka, for instance, it replicates the log for each topic's partitions across a configurable number of nodes. This allows automatic failover to these replicas when a server in the cluster fails. This way, messages remain available even in the presence of failures, providing fault tolerance.
It is also designed to scale horizontally by adding more nodes on the fly to handle increased throughput. Consensus algorithms are central to all distributed systems and it leverages a variety of algorithms to elect leaders for effective node coordination. Kafka earlier used Zookeeper for node coordination; now it has migrated to internal metadata brokers for the same.
In addition, it is built around the publish-subscribe messaging pattern where producers publish messages to topics, and consumers subscribe to topics to receive messages. To code a system like Kafka we need to be aware of the distributed systems fundamentals.
There are a bunch of other associated concepts as well; I've listed some of the important ones to give you an overview. This roadmap will be actively updated in the future as and when I discover new stuff.
Next up is system design.
2. System Design/Software Architecture Concepts
System design concepts are closely related to the distributed systems fundamental concepts. If we take a look back at the distributed service stack, the distributed systems fundamentals revolve around the cluster infrastructure layer and the system design or software architecture concepts revolve around the application software layer.
Fundamentals like the client-server architecture, basics of message queues, databases, load balancing, strong and eventual consistency (check out a detailed post done by me on different consistency levels here), API design, communication protocols like REST, RPC, etc., concurrent request processing, data modeling, and design patterns like the publish-subscribe, event sourcing, saga, circuit-breaker, leader-follower, CQRS, and more are key to building scalable, available, and reliable systems.
Taking Kafka as an example again, it is designed to be highly scalable; principles such as horizontal scaling, data partitioning, load balancing, concurrency models, synchronization techniques, modularity, maintainability, etc., are crucial to building such a performant and extendable system.
Understanding data modeling with concepts like normalization, denormalization, etc., helps us design the distributed message broker topics and partitions with overall data persistence to hold message data flexibly. Having modular code helps us build several connectors, plugins, and extensions for our system without much fuss.
Both the system design/software architecture concepts and the distributed system fundamentals are a vital part of the system design interviews and coding distributed systems.
If you wish to master the fundamentals, check out my Zero to Software Architecture Proficiency learning path, comprising three courses that go through all the concepts starting from zero in an easy-to-understand language. The courses educate you, step by step, on the domain of software architecture, cloud infrastructure and distributed services design.
You can also check out several system design case studies and blog articles that I have written in this newsletter and my blog.
Next up: picking the fitting programming language.
3. Programming Language
While you can pick any server-side programming language to code a distributed system, there are a few factors that need to be considered while selecting a language, such as its:
Ecosystem
Performance
Community and support
Language's compatibility with our system requirements, such as support for async I/O, performant multi-threaded behavior leveraging multicore processors, efficient memory management, low-level networking, etc.
Developer expertise in that language
Existing distributed systems running in production coded in that language, and so on.
Some of the popular programming languages to code distributed systems are Java, Go, Rust, Scala, Erlang & Python.
What programming language did I pick?
I chose Java primarily because:
1. It's a robust and performant language to code a distributed system from scratch, given it offers built-in support for concurrency (supporting both I/O and CPU-intensive applications), provides networking APIs, has a mature ecosystem, the code can be deployed across different environments supporting JVM.
For instance, Kafka is written in Scala, which is a JVM-based language and is interoperable with Java. Many of Kafka's libraries and connectors are written in Java.
2.Java is battle-tested and heavily used in the industry (this means a solid ecosystem). A bunch of popular distributed systems are implemented in Java, such as Apache Hadoop, ZooKeeper, Elasticsearch, Apache Storm, Flink, Cassandra, HBase, Apache Ignite, etc.
3. I am proficient in Java with years of industry experience, having coded several scalable modules from the bare bones, in addition to being involved in bug fixing, and production support for these services.
4. Data Structures & Algorithms
This goes without saying: data structure and algorithms are crucial for building performant distributed systems. When our systems get hit with concurrent traffic in the scale of millions, data structures and algorithms enable our systems process data/requests efficiently at blazing speed.
For instance, knowledge of B-trees and hash tables is crucial for implementing efficient data storage and retrieval systems. Merkle trees are leveraged for data synchronization in a distributed environment. Bloom filters can help improve lookup times, reduce memory consumption, and optimize query performance in distributed databases and caches.
Algorithms, such as distributed consensus algorithms, help us manage varied levels of data consistency across the nodes. Consistent hashing is leveraged to distribute and replicate data efficiently across the cluster. Techniques like weighted routing and other dynamic routing algorithms help optimize resource utilization, system performance, and so on.
You get the idea. DSA is foundational in building optimized and performant systems.
5. Networking
Knowledge of low-level networking, including socket programming and how network communication happens in a distributed environment, is crucial in implementing distributed systems. Knowing network protocols such as TCP/IP, HTTP, UDP, etc., helps us understand how data is transmitted across machines/nodes in a network.
Distributed systems at a low level rely on protocols such as TCP/IP to reliably transmit data across a distributed environment. HTTP, we all are aware that the web runs on it; It's the fundamental protocol for client-server communication.
In addition, to design distributed systems capable of running across data centers in different geographical regions globally, we should understand the network topologies. This enables us to create scalable software capable of functioning with optimized network latency and minimal communication overhead.
As an example, Cassandra can detect network partitions and adjust clusters dynamically to maintain data consistency and availability.
Here are a few books that I recommend that will get you upto speed on networking fundamentals:
TCP/IP Illustrated by W. Richard Stevens: the book provides an in-depth exploration of the TCP/IP protocol, which is the foundation of communication over the web. It provides insights into how different protocols operate and interact within the TCP/IP framework and their implementation in real-world systems.
HTTP: The Definitive Guide by O Reilly Media for a comprehensive understanding of the HTTP protocol, including its intricacies.
High-Performance Browser Networking by Ilya Grigorik: this book explores the intricacies of modern web networks, including protocols and performance optimization techniques. It covers topics such as TCP, UDP, HTTP, HTTPS, content delivery networks (CDNs), and more.
6. Operating Systems
Operating systems knowledge is another crucial aspect of implementing distributed systems.
Why?
Understanding memory management concepts such as virtual memory, paging, memory allocation, resource management, inter-process communication and scheduling algorithms helps optimize memory usage, including efficient task execution in distributed systems.
Knowledge of processes, threads, locks, semaphores, monitors, etc., enables us to code efficient systems capable of handling high-volume concurrent requests, ensuring optimum resource utilization. This augments our system throughput and responsiveness.
I keep on sharing gists of what I learn on my social media handles. Here are a couple of my posts on programs, processes, and threads and how our web browsers leverage them to stay responsive, secure & performant, I posted recently.
You can follow me on LinkedIn and X to receive insights and updates in your home feed. In addition, you can also subscribe to this newsletter to receive the content in your inbox as soon as it is published.
In addition to the above OS concepts, understanding file systems, I/O operations, and such helps us code systems intended to handle large-scale data storage and retrieval like Google BigTable, Hadoop, etc.
Here is a Kafka example that will help you comprehend better why understanding operating systems is vital in coding performant distributed systems:
Kafka performance relies on the implementation of the zero-copy optimization principle. It bypasses the file system cache and interacts directly with the disk for read writes. This approach minimizes latency and maximizes throughput by reducing unnecessary overhead associated with file system caching.
I can get into the details of this, but it will unnecessarily extend the length of this post. So, I'd better skip it for now. The essence is for performance hacks like these we ought to know the OS fundamentals.
7. Software Engineering
Upto this point, we discussed the core ingredients of building a distributed system. When we start with the project, this goes without saying: a bunch of software engineering principles will apply, such as writing modular and extendable code, using fitting design principles, thorough testing of our code, which includes unit, integration, stress tests, and whatever other tests are required, efficient logging and system monitoring.
8. Deployment & Cloud Computing
Our software needs to be deployed in a distributed environment to test its functioning. This may need leveraging technologies such as Docker and Kubernetes. I haven't dug deep into what tech exactly will be required to deploy our system yet. Will update the roadmap as I proceed with the development.
We also need to cover the fundamentals of cloud computing to efficiently work our way around deployment. We may have to leverage a public cloud platform like AWS, GCP, or Azure to deploy our code.
If you are hazy on cloud fundamentals, my Zero to Mastering Software Architecture learning path has got you covered. It contains a dedicated course that covers cloud fundamentals including concepts like serverless, running complex workflows, vendor lock-in, cloud infrastructure, how large-scale services are deployed across different cloud regions and availability zones globally, deployment workflow and the associated infrastructure and technologies.
Moreover, the cloud is just a marketing term. Under the covers, it's all distributed systems.
The Good Part
In case you are overwhelmed after going through the detailed roadmap, please don't be. We do not have to know all the topics extensively to start our coding journey. We can start step by step and delve into the details as and when required.
I reckon software engineering fundamentals, data structure and algorithms, knowing a programming language, distributed systems fundamentals and system design are the norm for software engineers.
Writing concurrent code using threads, networking, socket programming, and operating systems knowledge may be hazy, or you might have never delved into it at all. That's fine. Learning something new is always exciting. Besides, as I proceed in my journey towards coding a distributed system, I will provide insights into what concepts are applicable with examples and how to learn them.
If you already wish to practice coding distributed systems like Redis, Docker, Git, a DNS server and more step-by-step from the bare bones in the programming language of your choice, check out CodeCrafters (Affiliate). With their hands-on courses, you not only gain an in-depth understanding of distributed systems and advanced system design concepts but can also compare your project with the community and then finally navigate the official source code to see how it’s done. It’s a headstart into becoming an OSS contributor.
You can use my unique link to get 40% off if you decide to make a purchase. Cheers!
How I Wrapped My Head Around Beginning to Code a Complex System Like Kafka
When I got down to writing code after brushing up on the networking and Java multithreading fundamentals, I couldn't wrap my head around where do I start. What is the first line of code, or the class or the package, that I should create since even a prototype would entail implementing multiple features and concepts.
I tried looking into the Kafka production code, and as I clicked in, I just stared at the vast expanse of code and complexity as if walking through endless stretches of land, unsure where to begin. Every line of code felt like a riddle waiting to be unraveled, acting as a doorway to another realm of complexity.
I hope you feel me.
I quietly clicked out of the repo and took a deep breath.
I eventually decided to break things down. Divide & Conquer.
Consensus algorithms are the core of any distributed system, so I decided to implement a consensus algorithm like Raft first. That would give insight into how nodes coordinate in a concurrent distributed scenario.
Researching implementing Raft brought me to replicated state machines. Distributed systems leverage state machines to efficiently manage state across different nodes in a cluster in addition to ensuring a fault-tolerant system. It was a good idea to implement them while implementing a consensus algorithm.
To implement a distributed state machine, I needed to first get several nodes up and running chit-chatting with each other. This required coding a TCP/IP server.
So, I coded a bare-bones TCP/IP server and tested it via Telnet (Woohoo), and will now incrementally add features to it, like the ability to handle concurrent requests, serve requests over HTTP in a blocking and non-blocking fashion, and more.
In my next post I’ve discussed the implementation of a single-threaded blocking bare bones TCP/IP server and the associated concepts. Do check it out. If you found this post interesting, stay tuned. I'll keep sharing insights into my development process.
If you found this newsletter post helpful, consider sharing it with your friends for more reach.
If you are reading the web version of this post, consider subscribing to get my posts delivered to your inbox as soon as they are published.
You can get a 50% discount on my courses by sharing my posts with your network. Based on referrals, you can unlock course discounts. Check out the leaderboard page for details.
You can find me on LinkedIn & X and can chat with me on Substack chat as well. I'll see you in the next post. Until then, Cheers!