Your first experiment¶

This page walks you from a fresh install to a verdict report. Estimated time: 15 minutes (assuming you have a Langfuse project with at least 20 traces).

Note

The flow described here is the v0.1 design. Until v0.1 ships, treat this as the documented contract whatif is being built against.

1. Set credentials¶

export LANGFUSE_HOST="https://cloud.langfuse.com"   # or your self-hosted URL
export LANGFUSE_PUBLIC_KEY="pk-lf-..."
export LANGFUSE_SECRET_KEY="sk-lf-..."

export ANTHROPIC_API_KEY="sk-ant-..."   # for the Inspect AI judge

2. Implement the runner contract¶

Create my_agent/replay.py:

from whatif.contract import TraceInput, ReplayConfig, ToolCache, ReplayOutput
from anthropic import Anthropic

client = Anthropic()

def run(
    trace_input: TraceInput,
    config: ReplayConfig,
    tool_cache: ToolCache,
) -> ReplayOutput:
    # Apply the proposed system_prompt change.
    response = client.messages.create(
        model="claude-haiku-4-5",
        max_tokens=1024,
        system=config.system_prompt or "You are a helpful assistant.",
        messages=[{"role": "user", "content": trace_input.user_message}],
    )
    return ReplayOutput(text=response.content[0].text)

This is the smallest possible runner-no tool use, no multi-step. Real agents will be longer; the contract is the same.

3. Prepare the proposed change¶

Create prompts/v3.txt with the new system prompt you want to test:

You are a senior on-call SRE...
[your improved instructions here]

4. Run the experiment¶

whatif fork \
    --source langfuse \
    --target "python:my_agent.replay:run" \
    --failures "score-below:0.6,since:24h,limit:20" \
    --baseline "score-above:0.8,since:24h,limit:20" \
    --change "system_prompt=prompts/v3.txt" \
    --tool-cache use-original \
    --score "inspect_ai:faithfulness" \
    --report ./reports/$(date +%F)-prompt-v3.md \
    --json   ./reports/$(date +%F)-prompt-v3.json \
    --fail-on-regression

5. Read the verdict¶

Open the Markdown report:

cat ./reports/$(date +%F)-prompt-v3.md

You’ll see the five mandatory sections:

Verdict-Ship | Don't Ship | Inconclusive
Stats-improvement / regression counts per cohort
Replay validity-how many traces actually replayed
Baseline integrity-non-failure regression check
Evidence-top 3 improvements + top 3 regressions with judge rationale

If the verdict is Ship, the exit code is 0. If Don't Ship, the exit code is 1 (this is what blocks the PR check in Path Z). If Inconclusive (setup or replay failure), the exit code is 2.

6. Decide¶

Use the report to decide. Either:

Ship: open the PR with the report attached as evidence.
Don’t ship: read the regression cases, refine the prompt, run again.

The point of the loop is that you can iterate in minutes rather than hours, with evidence at every step.

Next¶

Read the report anatomy to understand what each section is telling you.
Read Runner contract for richer agent shapes (tool use, multi-step).
See the config reference for the YAML mode that v0.2 introduces.