Observability — OpenTelemetry traces

Linkworld instruments the platform with OpenTelemetry (OTel) distributed tracing. When activated, every HTTP request becomes a root span; every LLM call, tool dispatch, and downstream HTTP call nests under it. Traces export over OTLP/HTTP to whichever backend you point them at.

This sits alongside the existing Prometheus metrics — the metrics path (counters, histograms, dashboards) is unchanged. OTel adds per-request drill-down for debugging “what happened in that specific request.”

Activate

Set environment variables on the platform process. Inert until set.

# Required — the OTLP endpoint to send to.
export OTEL_EXPORTER_OTLP_ENDPOINT=https://otlp.your-backend.example/v1/traces

# Optional — auth headers your backend requires.
export OTEL_EXPORTER_OTLP_HEADERS="Authorization=Bearer xyz123"

# Optional — service name shown in the backend (default: linkworld-core).
export OTEL_SERVICE_NAME=linkworld-prod

# Optional — extra resource attrs (deployment.environment, etc.).
export OTEL_RESOURCE_ATTRIBUTES="deployment.environment=prod,service.version=0.x.x"

Restart the API container; the lifespan boots the OTel SDK and auto-instruments FastAPI + httpx on startup. Logs print otel: enabled (service=…, endpoint=…).

To stop emitting, unset OTEL_EXPORTER_OTLP_ENDPOINT and restart. There’s no runtime toggle — the SDK initializes once per process.

What gets traced

Layer	Span name	Attributes
HTTP request	`<HTTP method> <route>`	`http.method`, `http.route`, `http.status_code` (auto via FastAPIInstrumentor)
LLM call	`chat <model>`	`gen_ai.system`, `gen_ai.operation.name`, `gen_ai.request.model`, `gen_ai.request.max_tokens`, `gen_ai.usage.input_tokens`, `gen_ai.usage.output_tokens`, `gen_ai.response.model`, `gen_ai.linkworld.role`, `gen_ai.linkworld.cache.{read,create}_tokens`
Platform tool dispatch	`tool <tool_name>`	`linkworld.tool.name`, `linkworld.tenant_id`, `linkworld.app_id`, `linkworld.tool.blocked` (on Security Gate denial), `linkworld.tool.block_reason`, `linkworld.tool.error`
Outbound HTTP	`<HTTP method>`	Standard semconv (auto via HTTPXClientInstrumentor) — covers Anthropic, OpenAI, Microsoft Graph, etc.

LLM spans follow the OpenTelemetry GenAI semantic conventions so GenAI-aware backends (Langfuse, Arize Phoenix, Datadog LLM Observability) recognize them as LLM calls and render token / cost / latency dashboards automatically.

Backends that work

The OTLP/HTTP exporter is vendor-neutral. Tested with:

Langfuse — set OTEL_EXPORTER_OTLP_ENDPOINT=https://us.cloud.langfuse.com/api/public/otel/v1/traces and the auth header. Renders LLM token usage + cost.
Arize Phoenix — local self-hosted, localhost:6006/v1/traces.
Grafana Tempo — https://tempo.your-grafana.example/otlp/v1/traces.
Google Cloud Observability — needs the google-cloud-trace exporter not OTLP/HTTP, but the GenAI attrs render natively in the GCP trace UI.
Honeycomb / Datadog / New Relic — native OTLP support.

You can run multiple backends side-by-side by prepending an OpenTelemetry Collector and fanning out in its config — the platform only knows about one OTLP endpoint.

Retention and PII

The platform does not put prompt or response text into spans. Only operational attributes (model, token counts, cache hits, agent role, tool name, status). That’s deliberate so traces stay safe to ship to third-party SaaS without exfiltrating tenant content.

If you want prompt/response capture for a specific tenant during debugging, the recommended approach is to use Langfuse / Arize client SDKs directly inside the LLM client behind a feature flag — not the OTel transport. (Future work; not shipped.)

Trace context propagation

The platform respects W3C traceparent headers on inbound HTTP requests, so traces from upstream services (e.g. a tenant-shell gateway) thread into platform spans. App containers (Phase 2 Open App Platform) currently don’t propagate traceparent — that’s planned.

Future work

Propagate traceparent through the MCP / cross-app bus so app handler spans nest under the originating platform request.
Optional prompt/response capture behind a per-tenant flag for debugging (off by default; PII review required).
Heartbeat dispatcher span (cron-like; would need an artificial root span per fire — design open).
Migrate Prometheus metrics to OTel Metrics SDK while keeping Prometheus as the primary backend (Phase B; not started).