Langfuse¶
Langfuse is whatif’s v0.1 default source adapter. If you’re already using Langfuse for tracing, whatif reads from it directly-no re-instrumentation, no parallel data pipeline.
What whatif reads from Langfuse¶
For each selected trace:
Inputs-what the user sent to the agent.
Outputs-what the agent returned (the original output the verdict is diffed against).
Spans + tool calls-populated into the
ToolCacheforuse-originalreplay.Scores-used by the failure / baseline selectors (
score-below:0.6,score-above:0.8).Metadata + tags-used for cohort filtering (
tag:incident-triage).
whatif is read-only on Langfuse. It never writes traces back.
Configuration¶
export LANGFUSE_HOST="https://cloud.langfuse.com"
export LANGFUSE_PUBLIC_KEY="pk-lf-..."
export LANGFUSE_SECRET_KEY="sk-lf-..."
whatif fork \
--source langfuse \
--target "python:my_agent.replay:run" \
--failures "score-below:0.6,since:24h,limit:20" \
--baseline "score-above:0.8,since:24h,limit:20" \
--change "system_prompt=prompts/v3.txt" \
--score "inspect_ai:faithfulness"
source:
type: langfuse
project: incident-triage-prod
endpoint: ${LANGFUSE_HOST}
Selector grammar¶
Selector |
Effect |
|---|---|
|
Traces with at least one score below |
|
Traces with at least one score above |
|
Bound the time window (e.g., |
|
Cap the cohort size. |
|
Filter by Langfuse tag. |
|
Random subsample with a fixed seed (use the same seed in CI). |
Selectors compose with commas: "score-below:0.6,since:24h,limit:20,tag:critical".
Tool-call cache¶
Langfuse traces include tool spans with their inputs and outputs. whatif’s use-original cache policy replays your agent against these stored outputs-destructive side effects don’t re-fire.
If a trace has incomplete tool spans (e.g., a span missing its output), whatif records the trace as a replay failure in the report’s Replay validity section. v0.3 adds a live cache policy with per-tool allowlists for cases where original-cache fails because tool outputs were time-sensitive.
Scope of v0.1 support¶
Feature |
v0.1 |
Notes |
|---|---|---|
Read traces by score / tag / time |
:white_check_mark: |
Core selectors. |
Read tool spans for cache |
:white_check_mark: |
|
Multi-tenant Langfuse projects |
:white_check_mark: |
Pass |
Self-hosted Langfuse |
:white_check_mark: |
Set |
Write scores back to Langfuse |
:x: (planned for v0.3) |
Read-only in v0.1. |
Langfuse “experiments” feature integration |
:x: |
|
Limitations¶
Trace completeness matters. Langfuse traces with sparse spans replay poorly under
use-original. Watch the Replay validity section of the report.Custom span types (non-LLM-call, non-tool-call spans) are passed through to the runner via
trace_input.metadatabut not interpreted bywhatifitself.Sampling rate. If your Langfuse instrumentation samples (e.g., 1% of production), the failure cohort selectors operate on the sampled traces, not all production traffic. Plan cohort
limitaccordingly.