Observability — OpenTelemetry traces
Linkworld instruments the platform with OpenTelemetry (OTel) distributed tracing. When activated, every HTTP request becomes a root span; every LLM call, tool dispatch, and downstream HTTP call nests under it. Traces export over OTLP/HTTP to whichever backend you point them at.
This sits alongside the existing Prometheus metrics — the metrics path (counters, histograms, dashboards) is unchanged. OTel adds per-request drill-down for debugging “what happened in that specific request.”
Activate
Section titled “Activate”Set environment variables on the platform process. Inert until set.
# Required — the OTLP endpoint to send to.export OTEL_EXPORTER_OTLP_ENDPOINT=https://otlp.your-backend.example/v1/traces
# Optional — auth headers your backend requires.export OTEL_EXPORTER_OTLP_HEADERS="Authorization=Bearer xyz123"
# Optional — service name shown in the backend (default: linkworld-core).export OTEL_SERVICE_NAME=linkworld-prod
# Optional — extra resource attrs (deployment.environment, etc.).export OTEL_RESOURCE_ATTRIBUTES="deployment.environment=prod,service.version=0.x.x"Restart the API container; the lifespan boots the OTel SDK and
auto-instruments FastAPI + httpx on startup. Logs print
otel: enabled (service=…, endpoint=…).
To stop emitting, unset OTEL_EXPORTER_OTLP_ENDPOINT and restart.
There’s no runtime toggle — the SDK initializes once per process.
What gets traced
Section titled “What gets traced”| Layer | Span name | Attributes |
|---|---|---|
| HTTP request | <HTTP method> <route> | http.method, http.route, http.status_code (auto via FastAPIInstrumentor) |
| LLM call | chat <model> | gen_ai.system, gen_ai.operation.name, gen_ai.request.model, gen_ai.request.max_tokens, gen_ai.usage.input_tokens, gen_ai.usage.output_tokens, gen_ai.response.model, gen_ai.linkworld.role, gen_ai.linkworld.cache.{read,create}_tokens |
| Platform tool dispatch | tool <tool_name> | linkworld.tool.name, linkworld.tenant_id, linkworld.app_id, linkworld.tool.blocked (on Security Gate denial), linkworld.tool.block_reason, linkworld.tool.error |
| Outbound HTTP | <HTTP method> | Standard semconv (auto via HTTPXClientInstrumentor) — covers Anthropic, OpenAI, Microsoft Graph, etc. |
LLM spans follow the OpenTelemetry GenAI semantic conventions so GenAI-aware backends (Langfuse, Arize Phoenix, Datadog LLM Observability) recognize them as LLM calls and render token / cost / latency dashboards automatically.
Backends that work
Section titled “Backends that work”The OTLP/HTTP exporter is vendor-neutral. Tested with:
- Langfuse — set
OTEL_EXPORTER_OTLP_ENDPOINT=https://us.cloud.langfuse.com/api/public/otel/v1/tracesand the auth header. Renders LLM token usage + cost. - Arize Phoenix — local self-hosted,
localhost:6006/v1/traces. - Grafana Tempo —
https://tempo.your-grafana.example/otlp/v1/traces. - Google Cloud Observability — needs the
google-cloud-traceexporter not OTLP/HTTP, but the GenAI attrs render natively in the GCP trace UI. - Honeycomb / Datadog / New Relic — native OTLP support.
You can run multiple backends side-by-side by prepending an OpenTelemetry Collector and fanning out in its config — the platform only knows about one OTLP endpoint.
Retention and PII
Section titled “Retention and PII”The platform does not put prompt or response text into spans. Only operational attributes (model, token counts, cache hits, agent role, tool name, status). That’s deliberate so traces stay safe to ship to third-party SaaS without exfiltrating tenant content.
If you want prompt/response capture for a specific tenant during debugging, the recommended approach is to use Langfuse / Arize client SDKs directly inside the LLM client behind a feature flag — not the OTel transport. (Future work; not shipped.)
Trace context propagation
Section titled “Trace context propagation”The platform respects W3C traceparent headers on inbound HTTP
requests, so traces from upstream services (e.g. a tenant-shell
gateway) thread into platform spans. App containers (Phase 2 Open
App Platform) currently don’t propagate traceparent — that’s
planned.
Future work
Section titled “Future work”- Propagate
traceparentthrough the MCP / cross-app bus so app handler spans nest under the originating platform request. - Optional prompt/response capture behind a per-tenant flag for debugging (off by default; PII review required).
- Heartbeat dispatcher span (cron-like; would need an artificial root span per fire — design open).
- Migrate Prometheus metrics to OTel Metrics SDK while keeping Prometheus as the primary backend (Phase B; not started).