Picture running a complex distributed service like a global e-commerce website powered by several microservices on the backend, such as the product catalog service, inventory service, order service, payment service, and so on. These individual microservices could be composed of more microservices based on the module requirements. This gives us an idea of our service architecture complexity.
At peak times, the orders are being created into the system at a healthy rate, but over time, the customers start to experience delays in processing the payments. A significant percentage of cart payments begin to fail.
What do we do? How do we pinpoint the bug or the issue in our system? How do we figure out what part or component of our system is facing issues?
This is where end-to-end system observability saves the day.
What is observability?
Understanding Observability
Observability refers to the degree or level of understanding we have of our distributed system running in production. System observability enables us to analyze the system's behavior and pinpoint and fix the issues the system experiences in production in minimal time. Without it, we would sit in the dark without any idea of what had gone wrong.
Distributed services such as massive e-commerce sites, movie streaming platforms, social networks, etc., deployed across the world in different cloud regions and availability zones are complex in nature. To ensure their smooth functioning, we need to have real-time production insights.
System observability helps us ensure system reliability, availability, scalability, and much more in distributed systems, which I am going to discuss in this post.
We'll begin with telemetry, which is a fundamental component of observability.
What Is Telemetry?
Telemetry is an automated process of collecting and transmitting insightful data from different parts of a distributed system to a centralized location for monitoring and analysis. This enables the platform, development, and infrastructure teams to understand what went wrong when the system starts to show unpredictable behavior.
What is this insightful telemetry data I am talking about?
Telemetry data involves logs, metrics, traces and other relevant contextual information. Continually sending telemetry data in real-time or at regular intervals from different microservices and other parts of the distributed system to a centralized location for monitoring is a key part of the continuous monitoring process.
This helps the teams gain insight into the behavior and performance of the system, enabling them to identify and fix infrastructural issues, in addition to performing future capacity planning.
Let's have a quick insight into logs, metrics and traces.
Telemetry Data (Logs, Metrics, Traces)
Logs
When we code an application, along with the main code, we add logs that give us an insight into the code flow during the development phase as well as after the application is deployed in production.
During development, logs help us debug the code, and in production, they give us a clear insight into the application flow, helping us understand the system's behavior. This is how we get visibility into the functioning of our software.
Metrics
Metrics primarily help us gauge the system's performance. Typical metrics that we analyze are response times, rate of specific user events, throughput, CPU, memory, disk utilization, network latency, error rates, system availability metrics (typically in %), and so on.
So, for instance, if I develop a certain application module or feature and deploy it on the server. The logs will help me understand the code flow. Metrics will help me understand the server resource usage. With it, I can gauge the resource consumption of virtual machines, bare metal servers or the cloud platform our workload is hosted on.
This helps with the capacity planning, understanding what service features are resource-hungry, and helps ensure our infrastructure has enough capacity to handle the peak traffic.
Besides the infrastructure resource consumption, metrics help us understand the service response times, throughput, user events, error rates, etc., as well.
Img src: Grafana
The above is the image of a Grafana dashboard displaying production metrics.
There are no strict rules on what events, insights, or data can be termed as metrics. Whatever helps us understand our system over a period of time can be deemed a helpful metric.
Traces
Traces provide information on the flow of requests as they travel through different components in a distributed system. So, if the product purchase request goes through the product catalog service, to the inventory service, to the payments service, and so on, the entire flow can be traced via traces. This helps us understand the flow of requests through the system and if there are any system bottlenecks causing throughput issues, etc.
Traces are crucial in distributed system observability as they provide insights into the end-to-end journey of a request as it travels through different components in the system architecture, such as the load balancers, proxies, caches, API gateways, backend servers, databases and so on. The more observability our system has, the better. There should ideally be no blind spots.
Summarizing what we learned before: Logs help us understand a specific part or component of our system; metrics help us understand the behavior and resource consumption of specific components as well as the entire system as a whole, and traces provide a higher level of end-to-end visibility into the system.
Besides the logs, metrics and traces, there is another element that is key to observability: Continuous Profiling.
Continuous Profiling
Continuous profiling provides a deeper insight into the production infrastructure in comparison to the telemetry data (logs, metrics, and traces) we discussed above.
Continuous profiling, which happens in production, is similar to code profiling and microbenchmarking that we developers do in our local systems before pushing our code to the remote repo.
If you are hazy on code profiling and microbenchmarking, here is the gist:
Code profiling, with the help of specific tools, helps us measure the code performance (it can be specific modules or the entire codebase) to gauge performance bottlenecks, excessive resource usage, memory leaks, and other issues. Code profilers collect data during code execution and provide insights into the code behavior.
Similarly, microbenchmarking focuses on profiling specific units of code at a very fine-grained level, aiming for high precision. In microbenchmarking, the scope is narrower than code profiling. Specific functions/methods or code snippets are tested in isolation to identify their performance.
Similarly, continuous profiling, in production, provides visibility into data structure memory issues, duration of function calls, memory allocation issues, CPU consumption, disk I/O consumption, etc., at the kernel and userspace levels.
The process gives us a map of the hot areas in our infrastructure from a resource consumption standpoint at such a deeper level, which is not possible with the above three types of telemetry data, making continuous profiling a critical part of system observability.
Back To Our E-commerce Distributed Service Use Case
When customers start experiencing delays in payment processing, we can check the logs to analyze the code flow and identify if any errors or exceptions are reported by the payment microservice or any other microservice that is part of the payment flow.
We have metrics to gauge issues at the infrastructure level. The resource consumption metrics will provide insight if any of the servers are overloaded and need horizontal scaling. In addition, the rate of order flow, product purchase and other related user events metrics are how, in the first place, we realized that the rate of orders created on the website was dropping.
We can further leverage the traces to gauge the request flow of the entire product purchase business flow to pinpoint specific system components experiencing the issue. These components can be load balancers, caches, microservices, API gateways, databases, message queues, and so on.
Finally, we have the continuous profiling data to analyze our system at a much deeper level, pinpointing if any specific code, service or component is hogging resources.
This is how system observability is vital in running reliable real-world distributed services. We cannot do without it.
Observability-Driven Development
Since observability is key to building modern distributed services, there are quite a number of observability tools/stacks leveraged in the industry that provide end-to-end observability solutions, such as the ELK (Elastic, Logstash, Kibana) stack, Prometheus, Grafana, DataDog, New Relic, Open Telemetry, Google Cloud Profiler, etc. with every tool/observability solution having its use case.
Now, let's understand how distributed services are built with observability in mind.
How Distributed Services Are Built With Observability In Mind
When we write code, we add monitoring code along with the business logic implementation to enable the monitoring tools to understand the code behavior when the application runs in production. Adding logs to our code is one example of it.
The process of adding observability code along with our main code is known as Instrumentation. When the service runs in production, the telemetry data is streamed to monitoring servers in an automated fashion to enable the devs and support teams analyze the data on dashboards of specific observability solutions. This gives us insights into what is happening in our live service.
Here is a step-by-step process, from writing code in our local machine to running it in production, of how we can ensure our code is performant as well observable.
When writing code, we need to ensure relevant logs, error statements, and exceptions are added to help understand the code flow in production. Key events should be logged with appropriate contextual information.
During the process, we can also micro-benchmark specific lines of code to gauge performance. Once we are done with writing code, it's a good idea to profile it using a code profiler for any performance bottlenecks.
The code should have excellent test coverage with unit and integration tests. Well, this goes without saying. Tests validate the correctness of code.
Static code analysis with relevant tools is also done during this phase to analyze the code for memory leaks, adherence of the code to the organization's code style, duplicate sections, vulnerabilities, etc. The whole process is more like an automated code review.
You may or may not see static code analysis directly related to performance and observability; I brought it up since it's a good development practice.
As the code is pushed to the remote repo, an automated build test is triggered on the CI (Continuous Integration) server that ideally runs the same checks on the CI server that the dev did in their local system, in addition to running additional scripts.
After the successful build, the code is deployed on the staging, testing or pre-prod environment based on the organization's practice, where it is stress-tested under simulated traffic. This is where metrics come in handy.
They give us insights into the system's behavior, bottlenecks, other scalability issues, and more when subjected to heavy traffic, as I discussed at the beginning of the post.
Once the code is deployed to production, continuous monitoring gets triggered to keep tabs on the infrastructure resource usage and application behavior in real-time.
Along with keeping an eye on the infrastructure, we analyze errors, exceptions and logs in real-time, leveraging error-tracking tools. Set up alerts and notifications get triggered if the error rates go beyond a certain threshold.
Before Beginning to Write Code: Observability Planning at the System Design Stage
Determining what points, metrics, and processes to monitor is crucial before we start coding the service. Observability planning should happen at the designing phase of our distributed service. Planning observability in the design phase enables us to accurately collect system metrics from containers, services, servers, and other specific components of our system architecture.
We should be aware of where to instrument and add relevant contextual information with the telemetry data before beginning to write code. When we are aware of the observability specifics, we can design our code in a way that enables us to modify and adapt system observability on an ongoing basis without significant code refactoring.
Folks, I believe this newsletter post gave you a detailed insight into the process of designing and developing performant and observable distributed services.
If you wish to delve deeper into the fundamentals of designing large-scale distributed systems, along with an understanding of the cloud infrastructure on which web-scale services run, check out the Zero to Software Architecture Proficiency learning path authored by me comprising three courses taking you right from zero to giving you a comprehensive insight into the fundamentals of distributed system design.
Additionally, if you wish to learn to code distributed systems from the bare bones, I am running a series on it in this newsletter. Do check it out here.
If you wish to practice coding distributed systems like Redis, Docker, Git, a DNS server and more from the bare bones in the programming language of your choice, check out CodeCrafters (Affiliate). With their hands-on courses, you not only gain an in-depth understanding of distributed systems and advanced system design concepts but can also compare your project with the community and then finally navigate the official source code to see how it’s done.
You can use my unique link to get 40% off if you decide to make a purchase.
You can get a 50% discount on my courses by sharing my posts with your network. Based on referrals, you can unlock course discounts. Check out the leaderboard page for details.
If you are reading the web version of this post, consider subscribing to my newsletter.
You can find me on LinkedIn & X. I'll see you in the next post. Until then, Cheers!