Langfuse

Langfuse is whatif’s v0.1 default source adapter. If you’re already using Langfuse for tracing, whatif reads from it directly-no re-instrumentation, no parallel data pipeline.

What whatif reads from Langfuse

For each selected trace:

  • Inputs-what the user sent to the agent.

  • Outputs-what the agent returned (the original output the verdict is diffed against).

  • Spans + tool calls-populated into the ToolCache for use-original replay.

  • Scores-used by the failure / baseline selectors (score-below:0.6, score-above:0.8).

  • Metadata + tags-used for cohort filtering (tag:incident-triage).

whatif is read-only on Langfuse. It never writes traces back.

Configuration

export LANGFUSE_HOST="https://cloud.langfuse.com"
export LANGFUSE_PUBLIC_KEY="pk-lf-..."
export LANGFUSE_SECRET_KEY="sk-lf-..."

whatif fork \
  --source langfuse \
  --target "python:my_agent.replay:run" \
  --failures "score-below:0.6,since:24h,limit:20" \
  --baseline "score-above:0.8,since:24h,limit:20" \
  --change "system_prompt=prompts/v3.txt" \
  --score "inspect_ai:faithfulness"
source:
  type: langfuse
  project: incident-triage-prod
  endpoint: ${LANGFUSE_HOST}

Selector grammar

Selector

Effect

score-below:N

Traces with at least one score below N.

score-above:N

Traces with at least one score above N.

since:DURATION

Bound the time window (e.g., 24h, 7d).

limit:N

Cap the cohort size.

tag:T

Filter by Langfuse tag.

sample:random,seed:N

Random subsample with a fixed seed (use the same seed in CI).

Selectors compose with commas: "score-below:0.6,since:24h,limit:20,tag:critical".

Tool-call cache

Langfuse traces include tool spans with their inputs and outputs. whatif’s use-original cache policy replays your agent against these stored outputs-destructive side effects don’t re-fire.

If a trace has incomplete tool spans (e.g., a span missing its output), whatif records the trace as a replay failure in the report’s Replay validity section. v0.3 adds a live cache policy with per-tool allowlists for cases where original-cache fails because tool outputs were time-sensitive.

Scope of v0.1 support

Feature

v0.1

Notes

Read traces by score / tag / time

:white_check_mark:

Core selectors.

Read tool spans for cache

:white_check_mark:

use-original policy.

Multi-tenant Langfuse projects

:white_check_mark:

Pass project: in config.

Self-hosted Langfuse

:white_check_mark:

Set endpoint:.

Write scores back to Langfuse

:x: (planned for v0.3)

Read-only in v0.1.

Langfuse “experiments” feature integration

:x:

whatif’s flow is parallel; we don’t write into Langfuse experiments.

Limitations

  • Trace completeness matters. Langfuse traces with sparse spans replay poorly under use-original. Watch the Replay validity section of the report.

  • Custom span types (non-LLM-call, non-tool-call spans) are passed through to the runner via trace_input.metadata but not interpreted by whatif itself.

  • Sampling rate. If your Langfuse instrumentation samples (e.g., 1% of production), the failure cohort selectors operate on the sampled traces, not all production traffic. Plan cohort limit accordingly.