Post

Observability: Logs, Metrics, and Traces — What to Instrument

Observability: Logs, Metrics, and Traces — What to Instrument

When production breaks at 2 a.m., the only thing that matters is how fast you can answer what’s wrong and why. That answer comes from observability — and a system you can’t observe is one you’re operating blind. Knowing what to instrument (not just which tool to buy) is a core senior-backend skill.

The problem

A request is failing. Is it the database, a downstream service, a bad deploy, a single sick instance? Without the right signals you’re reduced to guessing and redeploying. With them, you go from symptom to cause in minutes.

How to approach it

Lean on the three pillars, each answering a different question:

flowchart TB
    Logs[Logs<br/>what happened] --> Insight[Fast root cause]
    Metrics[Metrics<br/>how much / how often] --> Insight
    Traces[Traces<br/>where the time went] --> Insight
  • Logs — discrete events. What happened, exactly?
  • Metrics — aggregates over time. How often, how fast, how many?
  • Traces — a request’s path across services. Where did the time/error go?

What tech to use where

  • Structured logging. Emit JSON with consistent fields, not free-text. A correlation ID per request — propagated across services — lets you stitch one request’s whole story together. Centralize logs so they’re searchable (an ELK-style stack, as I used on Study Giveaway, makes this practical).
  • Metrics that matter. For request-driven services, track RED — Rate, Errors, Duration. For resources, USE — Utilization, Saturation, Errors. Alert on these, not on vanity counts.
  • Distributed tracing. In a microservice system, a single request crosses many services; tracing (OpenTelemetry-style) shows exactly which hop was slow or failed — invaluable for a gateway-fronted system like SHOB.COM.BD.
  • Actionable alerts. Alert on user-facing symptoms (error rate, latency, SLO burn), with enough context to act — not on every metric.

Pitfalls to watch for

  • Logging everything or nothing. Noise hides signal and costs money; silence leaves you blind. Log decisions and errors with context.
  • No correlation IDs. Without them you can’t follow one request across services — the single most common gap.
  • Alert fatigue. Alerts that don’t require action get ignored, including the real one.
  • Vanity metrics. “Total requests” looks nice but rarely tells you something’s wrong.

Takeaways

Instrument deliberately: structured logs with correlation IDs, RED/USE metrics, and distributed traces across service boundaries — then alert only on what’s actionable. A real-time security system like Data Citadel or a high-traffic marketplace like SHOB lives or dies on this: you can’t respond to what you can’t see.

This post is licensed under CC BY 4.0 by the author.