Langfuse¶
Langfuse is whatifd’s v0.1 default trace source. It ships as a separate distribution (whatifd-langfuse) so you only install it if you use it. If you’re already using Langfuse for tracing, whatifd reads from it directly — no re-instrumentation, no parallel data pipeline.
What whatifd reads from Langfuse¶
For each selected trace:
Inputs — what the user sent to the agent (wrapped as
Sensitive[str]at the boundary per cardinal #5).Outputs — what the agent returned (the original output the verdict is diffed against; also wrapped).
Spans + tool calls — populated into the
ToolCacheforuse-originalreplay.Scores — used by the failure / baseline selectors.
Metadata + tags — used for cohort filtering.
whatifd is read-only on Langfuse. It never writes traces back.
Install¶
uv pip install whatifd whatifd-langfuse whatifd-inspect-ai
whatifd-langfuse declares the Langfuse Python SDK as a runtime dependency. The core whatifd package does not pull Langfuse — you can use a different trace source (your own adapter, or the synthetic whatifd.adapters.stub.StubTraceSource) without ever installing it.
Configuration¶
whatifd fork is config-file driven. There are no --source / --target / --change CLI flags; everything goes in whatifd.config.yaml:
source:
adapter: langfuse
target:
runner: "python:my_agent.replay:run"
selection:
failure_cohort:
limit: 20
filter: "score-below:0.6,since:24h"
baseline_cohort:
limit: 20
filter: "score-above:0.8,since:24h"
change:
system_prompt: "You are a senior on-call SRE..."
scorer:
adapter: inspect_ai
score_fn: "python:my_pkg.scorers:faithfulness"
judge_provider: anthropic
judge_model_id: claude-haiku-4-5
rubric_id: faithfulness-v1
rubric_text: "Score 0-1 by faithfulness to the original output."
decision: {}
reporting:
profile: default
timeouts:
replay_seconds: 60.0
score_seconds: 30.0
Set Langfuse credentials via environment variables (the adapter reads them at startup):
export LANGFUSE_HOST="https://cloud.langfuse.com" # or self-hosted URL
export LANGFUSE_PUBLIC_KEY="pk-lf-..."
export LANGFUSE_SECRET_KEY="sk-lf-..."
whatifd fork --config whatifd.config.yaml
The full config schema is documented in the config reference.
Selector grammar¶
The filter field on each cohort accepts comma-separated selectors:
Selector |
Effect |
|---|---|
|
Traces with at least one score below |
|
Traces with at least one score above |
|
Bound the time window (e.g., |
|
Filter by Langfuse tag. |
|
Random subsample with a fixed seed (use the same seed in CI for reproducibility). |
Selectors compose: "score-below:0.6,since:24h,tag:critical". limit lives at the cohort level (sibling to filter), not inside the selector string.
Tool-call cache¶
Langfuse traces include tool spans with their inputs and outputs. whatifd’s default cache policy (use-original) replays your agent against these stored outputs — destructive side effects don’t re-fire.
If a trace has incomplete tool spans (e.g., a span missing its output), whatifd records the trace as a structured replay failure (cardinal #1: failures-as-data, never an unhandled exception) in the report’s Replay validity section. v0.3 adds a live cache policy with per-tool allowlists for cases where original-cache fails because tool outputs were time-sensitive.
Scope of v0.1 support¶
Feature |
v0.1 |
Notes |
|---|---|---|
Read traces by score / tag / time |
✅ |
Core selectors. |
Read tool spans for cache |
✅ |
|
Multi-tenant Langfuse projects |
✅ |
Adapter reads project from credentials. |
Self-hosted Langfuse |
✅ |
Set |
Write scores back to Langfuse |
❌ (planned) |
Read-only today. |
Langfuse “experiments” feature integration |
❌ |
|
Limitations¶
Trace completeness matters. Sparse-span traces replay poorly under
use-original. Watch the Replay validity section of the report; below the floor (min_replay_validity) the verdict isInconclusive.Custom span types (non-LLM, non-tool spans) are passed through to the runner via
trace_input.metadatabut not interpreted bywhatifditself.Sampling rate. If your Langfuse instrumentation samples (e.g., 1% of production), the cohort selectors operate on the sampled traces, not all production traffic. Plan cohort
limitaccordingly.