SG · 01° 17′ N
§ F03 · Field Note PUBLISHED
← Index

Zero-Overhead Observability — Designing Debuggability Without Shipping a Monitoring Tax

The dominant observability story right now is: install an agent on every host, run a sidecar in every pod, ship logs to a vendor that charges by the gigabyte, and pray the bill stays under your run rate. Every layer of that story is a tax on production. We pay five to fifteen percent of our compute budget for the privilege of knowing roughly what just happened — and most of the data we collect gets sampled out before anyone reads it.

There’s a better model. Treat debuggability as a property of the code, not a layer bolted on top. Design your services so that when something breaks, the answer is already in the artifacts you’d produce anyway.

What the tax actually buys

Run a flame graph on a typical APM-instrumented service and you’ll see the agent doing real work: spinning context propagation, sampling spans, serializing payloads, pushing them over the network. None of that work helps your users. It helps you, sometimes, when something goes wrong. The expected value of that help is high. The mean value is not.

Worse, most agents collect data that gets aggressively sampled at ingest. The high-cardinality fields you’d actually want during an incident — the customer ID, the request shape, the upstream service — are exactly what gets dropped.

The alternative: structured events as the source of truth

We design every service to emit a small number of structured events at every meaningful state transition. One event per request, one per state change, one per significant decision. Each event is a JSON record with a stable schema, a request ID, and the high-cardinality fields we care about.

That’s it. No spans, no traces, no agents. The events are the trace. When something breaks, we query them as a stream.

The rules:

  • Stable IDs. Every event has a name like request.received or billing.charge.attempted. The names are versioned and reviewed. They don’t change without a migration.
  • Correlation IDs propagate. A request ID enters at the edge and is stamped on every downstream event. Joining is a database query, not a tracing system.
  • High cardinality is fine. Customer IDs, request fingerprints, model versions — all of it goes in. The backend is a column store; cardinality is cheap.
  • Sampling is per-trace, not per-event. Either you keep the whole request or you don’t. Half-traces are useless.

What you give up

Auto-instrumentation, mostly. With this model, an event has to be written by a human (or an agent) deliberately. You don’t get a free flame graph of every function call.

We think that’s the right trade. Free flame graphs are too cheap to be useful — they show you everything, which is the same as showing you nothing. A small number of well-chosen events tells the story of a request in a way you can read.

The substrate

You need three things and only three:

  1. A logger in each service that produces structured JSON with stable schema. This is twenty lines of code.
  2. A queryable backend. ClickHouse, DuckDB, even a partitioned Parquet bucket with Athena. Whatever you can SQL.
  3. A propagation convention for the request ID. One header, inherited everywhere.

That’s the entire observability stack. No agent, no sidecar, no vendor tier. Production runs at native speed. Incident resolution is a series of SQL queries against the events you would have written anyway.

The principle

Observability that costs nothing in production isn’t a thing you buy. It’s a thing you design. If you can’t tell what your service is doing without strapping a monitor to it, that’s a code smell — not a monitoring problem.