Datadog LLM Observability

Datadog LLM Observability is whatifd’s fourth supported trace source, shipped in v0.3 as the optional whatifd-datadog package. whatifd reads previously-ingested LLM spans from the LLM Observability Export API and projects each trace into the same RawTrace shape every other adapter produces — so you can fork Datadog-traced agent turns, replay them under a proposed change, and gate a PR on the verdict.

whatifd is read-only on Datadog. It never writes spans back.

What whatifd reads from Datadog

For each selected span (via GET /api/v2/llm-obs/v1/spans/events):

  • input / output — the agent’s input and original output, returned by the Export API as SearchedIO ({value, messages}). Wrapped as Sensitive[str] at the boundary (cardinal #5); value is preferred, falling back to concatenated message content.

  • trace_id + parent_id — group spans into traces and identify the root.

  • span_kind (agent / workflow / llm / tool / …) — identifies the root (orchestration kinds) and projects child spans, including tool spans, into the trace’s tool-call structure.

  • tags / other attributes — pass through to RawTrace.metadata; PII-registered keys are wrapped as Sensitive[str].

Install

uv pip install whatifd whatifd-datadog

The adapter core is dependency-light and offline-testable. The Export-API HTTP client needs httpx, pulled by the [live] extra:

uv pip install "whatifd-datadog[live]"

The official datadog-api-client SDK does not expose a spans-read path, so whatifd-datadog wraps the documented Export API with a thin httpx client.

Credentials

Read from the environment (never from config):

Variable

Notes

DD_API_KEY

Datadog API key.

DD_APP_KEY

Datadog Application key — the Export API requires both keys, not just the API key.

DD_SITE

Region; defaults to datadoghq.com (use datadoghq.eu, us3.datadoghq.com, …).

Wiring

DatadogTraceSource is span-iterator-shaped (like the Phoenix adapter): it takes a zero-arg spans_provider yielding normalized span dicts. whatifd_datadog.client.make_spans_provider builds one from the Export API:

import os
from whatifd_datadog import DatadogTraceSource
from whatifd_datadog.client import DatadogExportClient, make_spans_provider

client = DatadogExportClient(
    api_key=os.environ["DD_API_KEY"],
    app_key=os.environ["DD_APP_KEY"],
    site=os.environ.get("DD_SITE", "datadoghq.com"),
)

source = DatadogTraceSource(
    spans_provider=make_spans_provider(
        client,
        ml_app="my-agent",
        from_ts="now-24h",     # ALWAYS set a window — see below
    ),
    cohort_classifier=lambda spans: (
        "failure" if any("whatifd:failure" in (s.get("tags") or []) for s in spans)
        else "baseline"
    ),
)

Warning

Always bound the time window. The Export API defaults to the last 15 minutes if no window is given. make_spans_provider requires an explicit from_ts (e.g. "now-24h") and raises ValueError otherwise — a forgotten window would silently yield a near-empty cohort.

From config (whatifd.config.yaml)

source:
  adapter: datadog
  dd_ml_app: my-agent
  dd_from: now-24h      # required for adapter: datadog
  dd_to: now            # optional, defaults to "now"
  dd_query: ""          # optional span-search filter

whatifd fork reads DD_API_KEY / DD_APP_KEY / DD_SITE from the environment, builds the client, and wires a default tag-based cohort classifier (whatifd:failure).

Statistical honesty

cluster_key_support() returns an empty tuple — whatifd does not mine Datadog session_id / trace_id as cluster keys (cardinal #10: no unannounced inferential commitments).

Sending verdict metrics back to Datadog

whatifd-datadog also ships a CI-side metrics emitter — the inverse direction. After whatifd fork writes its report, push the verdict + cohort metrics to Datadog so dashboards and monitors can track Ship-rate and regression trends:

whatifd fork --config whatifd.config.yaml
whatifd-datadog-emit reports/whatifd-fork-2026-06-04.json --tag service:my-agent

Emits gauges (agentless, POST /api/v1/series, needs DD_API_KEY):

  • whatifd.verdict.code0=ship / 1=dont_ship / 2=inconclusive (matches the CLI exit code; alert on > 0).

  • whatifd.cohort.{selected,replayed,scored,improved,regressed,unchanged,median_delta,ci_lower,ci_upper,floor_passed,regression_ratio,improvement_ratio} (tagged cohort:<name>).

  • whatifd.findings.blocking.

The emitter is out of whatifd core (it only reads the already-written report) and soft-fails by default (exit 0 on emission error) so a metrics hiccup can never turn a green verdict red in CI; pass --strict to make emission gate, --dry-run to print without submitting.

Known limitation

Datadog tool spans carry input as a rendered string, not structured args, so whatifd-datadog leaves ToolSpan.args unpopulated — the use-original tool cache (#108) does not fill from Datadog traces. This matches the other adapters’ status; structured-args extraction is tracer-specific follow-up work.