System DesignHard
Distributed tracing in microservices — OpenTelemetry, span context, sampling
Distributed tracing reconstructs one logical operation across N services as a trace made of spans linked by a shared trace_id.
Core vocabulary
- Trace — the whole request, one tree.
- Span — one unit of work (HTTP call, DB query). Has
span_id,parent_span_id, start/end time, attributes. - Context propagation —
traceparentheader (W3C) carriestrace_id+parent_span_idbetween services. - Sampler — decides which traces to keep. Head-based (decide at trace start) vs tail-based (after observing the whole trace).
OpenTelemetry — .NET setup
builder.Services.AddOpenTelemetry()
.ConfigureResource(r => r.AddService("orders-api"))
.WithTracing(t => t
.AddAspNetCoreInstrumentation()
.AddHttpClientInstrumentation()
.AddEntityFrameworkCoreInstrumentation()
.AddOtlpExporter(o => o.Endpoint = new Uri("http://otel-collector:4317")))
.WithMetrics(m => m.AddAspNetCoreInstrumentation().AddRuntimeInstrumentation());
Spans are emitted automatically for incoming HTTP, outgoing HTTP, and EF Core queries. Add your own:
using var activity = MyActivitySource.StartActivity("ChargeCard");
activity?.SetTag("order.id", orderId);
activity?.SetTag("amount.usd", amount);
Sampling — the cost knob
100% sampling is unaffordable at scale. Common configs:
| Strategy | Pros | Cons |
|---|---|---|
| Always-on | Full fidelity | High storage cost |
| Probabilistic 1% | Cheap | Rare bugs invisible |
| Parent-based | Decision propagates, consistent traces | Trace-start service decides for the whole graph |
| Tail sampling | Keep all errors + a % of success | Requires collector with buffering memory |
What good looks like
- Errors and slow traces — always keep.
- Successful sub-100ms traces — sample 1 to 5 %.
- Add
enduser.id,db.statement,http.status_codeas standard attributes. - Limit cardinality on tag values (never put
userIdas a metric label; OK on span attributes).
Common pitfalls
- Lost trace context across async boundaries — use
Activity.Currentcorrectly; bg tasks need explicit context capture. - Logs without
trace_id— instrument your logger to enrich every entry with the active trace + span IDs so logs and traces correlate. - Trace storage explosion — span attributes are cheap individually, lethal in aggregate. Set a budget per service.