Distributed Tracing with OpenTelemetry and Jaeger: A Hands-on Guide

Monitoring your system is not easy. It becomes critical when something breaks, and you need to find out what’s going on. If you can't see what is wrong with the system or specific requests or data, you need to reevaluate your tracing, metrics, and logs.

We will learn distributed tracing with the OpenTelemetry library.

Otel(OpenTelemetry) is an open-source monitoring library that is a de facto industry product and available in all programming languages.

A basic tracing example for an API could be to see your response time in your endpoint, which critical functions or layers are working in the request, and what kind of query you are running in the database layer.

The sections in tracing, like critical functions or layers, which I just said, are spans in Otel. It is a general term so you can see it in another distributed tracing product like New Relic, a paid product.

OpenTelemetry says: A span represents a unit of work or operation. Spans are the building blocks of Traces.

Jaeger is a UI project that shows traces for Open Telemetry. This UI makes it easy to find root causes. Check out their docs for more details.

Let's see an example trace within Jaeger.

GetOrderById Span

This trace has four spans(sections): one main span and its child spans, like a ladder.

The Api's name is Order Api, and our endpoint is the "orders/:id" endpoint. The total response time is 7.65 ms.

We have a Controller, Service, and Repository pattern here. Let's extend the Controller span.

This span has correlation.id, format, order.id, and span.kind tags. I have manually added CorrelationId and OrderId with code; the library generates the other two. You can add additional information like this to spans; it makes debugging and monitoring easier. Copy the correlation ID, paste it to Jaeger UI, and then BAM! You can see the whole trace of that Correlation ID.

The additional information I added, OrderId and CorrelationId, is called Attributes. OpenTelemetry says, "Attributes are key-value pairs that contain metadata that you can use to annotate a Span to carry information about the operation it is tracking." we did; we carried some IDs.

Let's get the to Service span.

This span has a Logs section, which you can add manually. I have added an event to there, like calling the X Repository. You can use Events for business logs to make your monitoring easier.

And let's get to our last section: Repository Span.

This span has two child spans. One I created, and the other is Gorm, an ORM library in Golang. As far as I understand, it is the Entity Framework of the Golang world.

The important thing that I want to talk about here is the Gorm section; let's extend its tags.

In this span, you can see your DB Statement, which sometimes can be critical to what you have run in your endpoint. When we use ORM everywhere, we can create complex and heavy queries that we really don't need. That is why I think this is a critical section.

I will share the complete code from Github. To make this technical section short, I will talk about essential parts.

To create a Tracing in your App, you need a Tracing Provider. You can create it when you are bootstrapping with this function.

To see Gorm Queries in Jaeger, the Gorm library has a built-in function to generate its span.

These two code sections can be in your main.go file.

Let's see the Controller file.

In this endpoint, You can get your TP(Trace Provider) with the name of it, then create a Span with the Start function.

Otel generates an ID for all spans; you can access your TraceId with span.SpanContext().TraceID().

Remember the tags in our Controller section? OrderId and CorrelationId. You can add them with span.SetAttributes() function.

If anything goes wrong in your function, business, or anything else, you can set the status to Error for your tracing with span.SetStatus() function.

There are three statuses in OpenTelemetry: Error, Ok, and Unset.

Unset is the default one, which means the span is completed without an error. OK is optional. If you want to mark the span succeed for a reason, you can use it; otherwise, Unset is Okay as well.

This is what our endpoint does in a nutshell. Let's see our Service Function.

This function is more straightforward than our endpoint function. It starts a span caller, "GetOrderById Service," with the Start function.

Then, we added an Event we saw in the Logs section in Jaeger.

We created our service.tracer object like this:

Let's see our Repository section.

We start a new span in the repository, then add an Event to see some logs in trace and add a second event at the end of the function so we can be sure that everything went well.

In here, db.WithContext() function is critical because Gorm creates a span, and we need to connect it with our context. Otherwise, we can't see the generated query section in Jaeger.

This is what distributed tracing, span, Jaeger, and Otel are.

I have added a second project to Github. It is just an app that calls our GetOrderById endpoint with a CorrelationId. You can call it at http://localhost:3000/call-service-b. After that, you will see your traces in Jaeger UI, the port of which is http://localhost:16686.

The Project has two dependencies, Postgres and Jaeger. I have added Docker commands in the GitHub Repository Readme file.

Now, we can talk about the "Distributed" part of Tracing. You can use the same TraceId when using microservices or multiple services. That means you can see all your spans across services in one trace in Jaeger UI. That is the most critical feature of Distributed Tracing.

You are moving your Context to the next service. This feature is called Context Propagation.

In a nutshell, the library carries a value called TraceParent in HTTP calls. You can extract the TraceId from there and create your span from that ID.

To propagate context, we need to set our Propagator when we initialize our Tracer. We also need to add this config to both the API and Client projects.

We need to inject our TraceParent value into Context for the client project before calling our HTTP.

In Order API, we need to access the Propagator, extract the context from there, and create a new span with that context.

As you can see in the code, I am getting the "transparent" key from Ctx and printing it on the console. An example transparent value is " 00-8b1273d5ba625f440e6fcce14fece8af-091e4e756e49e1f3-01."

So, what does this value mean?

Version "00": Indicates the version of the Trace Context specification. There is a specification for propagating the Context: the W3C Trace Context specification.

Trace Id "8b1273d5ba625f440e6fcce14fece8af": This is the Trace ID we created in the client project, and now Order Api has the same One.

Span Id "091e4e756e49e1f3": The Id of Span. Every span has a unique ID.

Trace Flag "01": Trace Flag is the value of whether the trace should be sampled. "01 means yes.

Now, when we call our Order Api GetOrderById endpoint with our client Project, we can see a trace like this in Jaeger:

As you can see, we have "service-a" our client project, and Order Api within the same trace. Two projects share the same Trace ID.

You can reach the codebase with Context Propagation of Order API from here and the client project here.

Sampling

Sampling is a technique used in distributed tracing to control the data collected by recording only a subset of requests. This is crucial for reducing the overhead of collecting, storing, and analyzing every trace in high-traffic systems. %1 sampling is common for high-throughput systems.

There are two sampling types in OpenTelemetry: Head and Tail Samplings.

In head sampling, the decision to sample a trace is made at the start of the request. The entire trace (including all child spans) is recorded if the request is selected for sampling. This approach is simple and efficient, but it may miss important events later in the request if they were not sampled initially.

In tail sampling, the sampling decision is made at the end of the request, allowing for more intelligent choices based on the outcome of the trace. For example, a trace may be selected for sampling if an error occurs during the request. Tail sampling is ideal for capturing rare or critical events, but it is more complex to implement and requires buffering trace data until the decision is made.

I created this Order Api in a YouTube video and then added Swagger integration and Gorm and Postgres integrations. Now that we have added OpenTelemetry, we can make a detailed YouTube video about it. You can see the Golang Videos from here.

I added a new feature to Api in every video to show the steps of a service.

That’s it for Distributed Tracing in OpenTelemetry. OpenTelemetry has many monitoring features, such as logs, metrics, etc., and I plan to create blog posts about them as well.

If you have any feedback, please write it down in the comments.

May the force be with you!