Langfuse

Langfuse is whatifd’s v0.1 default trace source. It ships as a separate distribution (whatifd-langfuse) so you only install it if you use it. If you’re already using Langfuse for tracing, whatifd reads from it directly — no re-instrumentation, no parallel data pipeline.

What whatifd reads from Langfuse

For each selected trace:

  • Inputs — what the user sent to the agent (wrapped as Sensitive[str] at the boundary per cardinal #5).

  • Outputs — what the agent returned (the original output the verdict is diffed against; also wrapped).

  • Spans + tool calls — populated into the ToolCache for use-original replay.

  • Scores — used by the failure / baseline selectors.

  • Metadata + tags — used for cohort filtering.

whatifd is read-only on Langfuse. It never writes traces back.

Install

uv pip install whatifd whatifd-langfuse whatifd-inspect-ai

whatifd-langfuse declares the Langfuse Python SDK as a runtime dependency. The core whatifd package does not pull Langfuse — you can use a different trace source (your own adapter, or the synthetic whatifd.adapters.stub.StubTraceSource) without ever installing it.

Configuration

whatifd fork is config-file driven. There are no --source / --target / --change CLI flags; everything goes in whatifd.config.yaml:

source:
  adapter: langfuse
target:
  runner: "python:my_agent.replay:run"
selection:
  failure_cohort:
    limit: 20
    filter: "score-below:0.6,since:24h"
  baseline_cohort:
    limit: 20
    filter: "score-above:0.8,since:24h"
change:
  system_prompt: "You are a senior on-call SRE..."
scorer:
  adapter: inspect_ai
  score_fn: "python:my_pkg.scorers:faithfulness"
  judge_provider: anthropic
  judge_model_id: claude-haiku-4-5
  rubric_id: faithfulness-v1
  rubric_text: "Score 0-1 by faithfulness to the original output."
decision: {}
reporting:
  profile: default
timeouts:
  replay_seconds: 60.0
  score_seconds: 30.0

Set Langfuse credentials via environment variables (the adapter reads them at startup):

export LANGFUSE_HOST="https://cloud.langfuse.com"   # or self-hosted URL
export LANGFUSE_PUBLIC_KEY="pk-lf-..."
export LANGFUSE_SECRET_KEY="sk-lf-..."

whatifd fork --config whatifd.config.yaml

The full config schema is documented in the config reference.

Selector grammar

The filter field on each cohort accepts comma-separated selectors:

Selector

Effect

score-below:N

Traces with at least one score below N.

score-above:N

Traces with at least one score above N.

since:DURATION

Bound the time window (e.g., 24h, 7d).

tag:T

Filter by Langfuse tag.

sample:random,seed:N

Random subsample with a fixed seed (use the same seed in CI for reproducibility).

Selectors compose: "score-below:0.6,since:24h,tag:critical". limit lives at the cohort level (sibling to filter), not inside the selector string.

Tool-call cache

Langfuse traces include tool spans with their inputs and outputs. whatifd’s default cache policy (use-original) replays your agent against these stored outputs — destructive side effects don’t re-fire.

If a trace has incomplete tool spans (e.g., a span missing its output), whatifd records the trace as a structured replay failure (cardinal #1: failures-as-data, never an unhandled exception) in the report’s Replay validity section. v0.3 adds a live cache policy with per-tool allowlists for cases where original-cache fails because tool outputs were time-sensitive.

Scope of v0.1 support

Feature

v0.1

Notes

Read traces by score / tag / time

Core selectors.

Read tool spans for cache

use-original policy.

Multi-tenant Langfuse projects

Adapter reads project from credentials.

Self-hosted Langfuse

Set LANGFUSE_HOST.

Write scores back to Langfuse

❌ (planned)

Read-only today.

Langfuse “experiments” feature integration

whatifd’s flow is parallel; it doesn’t write into Langfuse experiments.

Limitations

  • Trace completeness matters. Sparse-span traces replay poorly under use-original. Watch the Replay validity section of the report; below the floor (min_replay_validity) the verdict is Inconclusive.

  • Custom span types (non-LLM, non-tool spans) are passed through to the runner via trace_input.metadata but not interpreted by whatifd itself.

  • Sampling rate. If your Langfuse instrumentation samples (e.g., 1% of production), the cohort selectors operate on the sampled traces, not all production traffic. Plan cohort limit accordingly.