How to wrap our heads around large codebases and open-source GitHub repositories

Jul 12, 2024

When we face an unfamiliar large codebase either at our workplace or an open-source GitHub repository, our first thought is, 'How do I understand the high-level architecture of this project? I need to understand the inside out to learn and be able to contribute to it.'

We then click through the endless code files and are unavoidably overwhelmed. Trying to understand the high-level project architecture and business use cases encompassing several different complex system modules via code is the wrong move. It's a surefire way to get overwhelmed.

The advisable way is to go through the project documentation, including the design docs (if you have them available), to understand the high-level architecture, including various flows and business use cases.

Let's understand this better with the help of an example. Say, I browse through the GitHub repo of Elasticsearch. Elasticsearch is the defacto distributed search and analytics solution used in enterprise projects. The repo is continually evolving with support for vector searches, RAG, generative AI apps, real-time searches over massive datasets, etc.

To gain insights into what the project does, including the high-level architecture, we should go through the README file, the project documentation, and the blog as opposed to trying to figure these things out via code.

Project docs provide comprehensive information about the product architecture, different modules, internal specifics, deployment, etc. Blogs ideally contain more related information about the product in addition to simple hands-on examples. This is where we get our foot in the door.

Wrapping our heads around large codebases and open-source GitHub repositories is something that is not trivial and needs significant time investment.

Hoping to start contributing to a large codebase with minimal time investment is like walking into an organization that has its codebase been developed for years and pushing code to production on day one.

The open-source contribution insanity

A surprisingly big number of influencers are misguiding beginners to contribute to open-source to grow their skills and better their resumes and their followers are tripping over this in masses. This is like expecting a baby to climb Mt. Everest.

And with projects with minimal or no documentation, this is like having to climb without the supplemental oxygen cylinders. What do you expect the outcome to be?

I have always been advocating against this. Open-source contribution is not for beginners. If you want to grow your skills, develop projects from the bare bones. Know thy fundamentals; be good with low-level and high-level design patterns, system architecture, databases and everything that you would need to build a product from the bare bones.

I have aggregated a bunch of industry-relevant coding projects to help you get more hands-on practice and improve your development skills. Check those out.

If you want to master web architecture and system design, check out my Zero to Software Architecture Proficiency learning path, comprising three courses that go through all the concepts starting from zero in an easy-to-understand language. The courses educate you, step by step, on the domain of web/software architecture, cloud infrastructure and distributed system design.

Do not get caught up with the idea of contributing to the open-source right at the beginning of your career. Be deft with writing code, build scalable services from scratch and then maybe, if you feel like it, delve into open-source contribution.

Moreover, even if you want to write open-source code as a beginner as opposed to grappling with a mature, complex codebase, pick a proprietary product and build an open-source version of it from scratch. You'll learn dramatically more in contrast to contributing a few lines of code to an existing repo.

We were discussing that to understand the high-level architecture of a repository, we should look into the project docs.

What if there is no documentation?

If the documentation doesn't exist, then it is going to be an uphill battle. It will be like getting into a battle royale mode where we have to scavenge for resources after being dropped into an arena.

Check the README file, go through the commit history, try to get in touch with the project collaborators and core devs to get insights into the project.

Getting our hands dirty

The next step is to check the test cases to understand the business logic and other code functionality. If the functions are well-documented, that would help a lot in putting the pieces together.

Deploy the code in your local machine and start debugging. We may not have to understand every nook and cranny of the codebase rather can focus on specific flows.

If the project has a UI, for instance, if it's a web-based application, check the endpoints different sections of a webpage are hitting. Find those endpoints on the backend, put debuggers on it, and navigate through the code to figure out the flow.

This is the best way to understand large codebases. After enough debugging, you'll be able to make sense of larger parts of the codebase.

Feature development and production support

We prefer product development roles over production support jobs that largely involve bug fixing and site reliability tasks. However, debugging and fixing varied bugs across the system provides us with a deep knowledge of system design, which may not be possible during feature development, where we are focused on a specific part of the system.

Furthermore, while fixing bugs, we get exposed to different technologies used in the project, collaborate with different teams like testing, ops, etc., and delve into things like system observability, efficiency, scanning and getting rid of bottlenecks, infrastructure scaling and related things.

Via this, we develop critical problem-solving skills from a system and infrastructure standpoint. Regularly fixing varied bugs hones our troubleshooting and diagnostic skills, making us adept at identifying and resolving issues quickly.

I've written a detailed post on system observability in case you want to read it.

When we deploy a project in our local machine and debug through the code, we get a similar experience as that of when working as a developer on prod support. Debugging helps us understand the code flow and how different classes and functions interact via design patterns and such and enhances our knowledge of low-level code design.

As a beginner, you can learn from open-source, but don't be hasty about contributing to it just for the heck of it. Instead, build your own projects from scratch.

I am saying this from experience, as in my career, I've developed products and features from the bare bones, in addition to working as a developer on support on massive codebases, and I have learned immensely, having gotten the best of both worlds.

Can we leverage LLMs to understand a GitHub repo code?

I ran an extensive search on if we can leverage an AI tool that would scan through a codebase and delineate the information for easy understanding.

I came across a few tools, and almost all have a paywall, so I couldn't check how effective they were. Rightly so, scanning through a large repo and figuring things out would require significant compute. Also, regarding scanning third-party code via AI, we need to first check the respective repository license we intend to scan. Do go through their usage rights and respect them.

Also, my experience with AI is don't take everything it says at face value. It is continually making subtle mistakes that are harder to catch, especially if you have less experience on the topic you are taking AI's advice on. Cross-verify when you have doubts.

Know your domain.

If you found this post helpful, consider sharing it with your friends for more reach. If you are reading the web version of this post, consider subscribing to get my posts delivered to your inbox as soon as they are published.

Check out the list of industry-relevant coding projects you can do on the side for more hands-on practice.

You can find me on LinkedIn & X and can chat with me on Substack chat as well. I'll see you in the next post. Until then, Cheers!

Web Scale (Systems Architecture & Systems Programming)

Discussion about this post