Your first experiment

This page walks you from a fresh install to a verdict report. Estimated time: 15 minutes (assuming you have a Langfuse project with at least 20 traces).

Note

The flow described here is the v0.1 design. Until v0.1 ships, treat this as the documented contract whatif is being built against.

1. Set credentials

export LANGFUSE_HOST="https://cloud.langfuse.com"   # or your self-hosted URL
export LANGFUSE_PUBLIC_KEY="pk-lf-..."
export LANGFUSE_SECRET_KEY="sk-lf-..."

export ANTHROPIC_API_KEY="sk-ant-..."   # for the Inspect AI judge

2. Implement the runner contract

Create my_agent/replay.py:

from whatif.contract import TraceInput, ReplayConfig, ToolCache, ReplayOutput
from anthropic import Anthropic

client = Anthropic()

def run(
    trace_input: TraceInput,
    config: ReplayConfig,
    tool_cache: ToolCache,
) -> ReplayOutput:
    # Apply the proposed system_prompt change.
    response = client.messages.create(
        model="claude-haiku-4-5",
        max_tokens=1024,
        system=config.system_prompt or "You are a helpful assistant.",
        messages=[{"role": "user", "content": trace_input.user_message}],
    )
    return ReplayOutput(text=response.content[0].text)

This is the smallest possible runner-no tool use, no multi-step. Real agents will be longer; the contract is the same.

3. Prepare the proposed change

Create prompts/v3.txt with the new system prompt you want to test:

You are a senior on-call SRE...
[your improved instructions here]

4. Run the experiment

whatif fork \
    --source langfuse \
    --target "python:my_agent.replay:run" \
    --failures "score-below:0.6,since:24h,limit:20" \
    --baseline "score-above:0.8,since:24h,limit:20" \
    --change "system_prompt=prompts/v3.txt" \
    --tool-cache use-original \
    --score "inspect_ai:faithfulness" \
    --report ./reports/$(date +%F)-prompt-v3.md \
    --json   ./reports/$(date +%F)-prompt-v3.json \
    --fail-on-regression

5. Read the verdict

Open the Markdown report:

cat ./reports/$(date +%F)-prompt-v3.md

You’ll see the five mandatory sections:

  1. Verdict-Ship | Don't Ship | Inconclusive

  2. Stats-improvement / regression counts per cohort

  3. Replay validity-how many traces actually replayed

  4. Baseline integrity-non-failure regression check

  5. Evidence-top 3 improvements + top 3 regressions with judge rationale

If the verdict is Ship, the exit code is 0. If Don't Ship, the exit code is 1 (this is what blocks the PR check in Path Z). If Inconclusive (setup or replay failure), the exit code is 2.

6. Decide

Use the report to decide. Either:

  • Ship: open the PR with the report attached as evidence.

  • Don’t ship: read the regression cases, refine the prompt, run again.

The point of the loop is that you can iterate in minutes rather than hours, with evidence at every step.

Next