Langfuse¶

Langfuse is whatifd’s v0.1 default trace source. It ships as a separate distribution (whatifd-langfuse) so you only install it if you use it. If you’re already using Langfuse for tracing, whatifd reads from it directly — no re-instrumentation, no parallel data pipeline.

What `whatifd` reads from Langfuse¶

For each selected trace:

Inputs — what the user sent to the agent (wrapped as Sensitive[str] at the boundary per cardinal #5).
Outputs — what the agent returned (the original output the verdict is diffed against; also wrapped).
Spans + tool calls — populated into the ToolCache for use-original replay.
Scores — used by the failure / baseline selectors.
Metadata + tags — used for cohort filtering.

whatifd is read-only on Langfuse. It never writes traces back.

Install¶

uv pip install whatifd whatifd-langfuse whatifd-inspect-ai

whatifd-langfuse declares the Langfuse Python SDK as a runtime dependency. The core whatifd package does not pull Langfuse — you can use a different trace source (your own adapter, or the synthetic whatifd.adapters.stub.StubTraceSource) without ever installing it.

Configuration¶

whatifd fork is config-file driven. There are no --source / --target / --change CLI flags; everything goes in whatifd.config.yaml:

source:
  adapter: langfuse
target:
  runner: "python:my_agent.replay:run"
selection:
  failure_cohort:
    limit: 20
    filter: "score-below:0.6,since:24h"
  baseline_cohort:
    limit: 20
    filter: "score-above:0.8,since:24h"
change:
  system_prompt: "You are a senior on-call SRE..."
scorer:
  adapter: inspect_ai
  score_fn: "python:my_pkg.scorers:faithfulness"
  judge_provider: anthropic
  judge_model_id: claude-haiku-4-5
  rubric_id: faithfulness-v1
  rubric_text: "Score 0-1 by faithfulness to the original output."
decision: {}
reporting:
  profile: default
timeouts:
  replay_seconds: 60.0
  score_seconds: 30.0

Set Langfuse credentials via environment variables (the adapter reads them at startup):

export LANGFUSE_HOST="https://cloud.langfuse.com"   # or self-hosted URL
export LANGFUSE_PUBLIC_KEY="pk-lf-..."
export LANGFUSE_SECRET_KEY="sk-lf-..."

whatifd fork --config whatifd.config.yaml

The full config schema is documented in the config reference.

Selector grammar¶

The filter field on each cohort accepts comma-separated selectors:

Selector	Effect
`score-below:N`	Traces with at least one score below `N`.
`score-above:N`	Traces with at least one score above `N`.
`since:DURATION`	Bound the time window (e.g., `24h`, `7d`).
`tag:T`	Filter by Langfuse tag.
`sample:random,seed:N`	Random subsample with a fixed seed (use the same seed in CI for reproducibility).

Selectors compose: "score-below:0.6,since:24h,tag:critical". limit lives at the cohort level (sibling to filter), not inside the selector string.

Tool-call cache¶

Langfuse traces include tool spans with their inputs and outputs. whatifd’s default cache policy (use-original) replays your agent against these stored outputs — destructive side effects don’t re-fire.

If a trace has incomplete tool spans (e.g., a span missing its output), whatifd records the trace as a structured replay failure (cardinal #1: failures-as-data, never an unhandled exception) in the report’s Replay validity section. v0.3 adds a live cache policy with per-tool allowlists for cases where original-cache fails because tool outputs were time-sensitive.

Scope of v0.1 support¶

Feature	v0.1	Notes
Read traces by score / tag / time	✅	Core selectors.
Read tool spans for cache	✅	`use-original` policy.
Multi-tenant Langfuse projects	✅	Adapter reads project from credentials.
Self-hosted Langfuse	✅	Set `LANGFUSE_HOST`.
Write scores back to Langfuse	❌ (planned)	Read-only today.
Langfuse “experiments” feature integration	❌	`whatifd`’s flow is parallel; it doesn’t write into Langfuse experiments.

Limitations¶

Trace completeness matters. Sparse-span traces replay poorly under use-original. Watch the Replay validity section of the report; below the floor (min_replay_validity) the verdict is Inconclusive.
Custom span types (non-LLM, non-tool spans) are passed through to the runner via trace_input.metadata but not interpreted by whatifd itself.
Sampling rate. If your Langfuse instrumentation samples (e.g., 1% of production), the cohort selectors operate on the sampled traces, not all production traffic. Plan cohort limit accordingly.