Your first experiment

This page walks you from a fresh install to a verdict report. Two paths are shown: the programmatic API (works fully today against any custom delta_fn) and the whatifd fork CLI (config-file driven; works fully against the synthetic stub adapters and against real Langfuse + Inspect AI when credentials are wired).

The walkthrough below covers the failure_rescue experiment shape (the v0.1 default — a known-bad set of traces plus a proposed fix). For the other shape shipped in v0.2 — regression_check (a known-good baseline plus a candidate change, verdict on whether the change introduces regressions) — see the regression-check walkthrough. The CLI surface, exit codes, and report wire format are identical between the two shapes; only the verdict-policy guard chain and the required cohort set differ.

Path A — Programmatic API (no credentials needed)

The fastest way to see a real ReportV01 come out the other end is to drive the pipeline directly with the in-tree synthetic stub source. This runs entirely offline.

from whatifd.adapters.stub import StubTraceSource, StubTraceSpec
from whatifd.adapters.protocols import RawTrace
from whatifd.cache.summary import CachePolicySnapshot, CacheSummary
from whatifd.pipeline import run_pipeline
from whatifd.serialization import encode_report_v01
from whatifd.types.manifest import EnvironmentFingerprint, RunManifest
from whatifd.types.policy import DecisionPolicy, TrustFloor
from whatifd.types.statistical import (
    BootstrapMethodDisclosure,
    EffectSizeDisclosure,
    JudgeMethodDisclosure,
    MethodologyDisclosure,
    MultiplicityDisclosure,
)

specs = [
    StubTraceSpec(trace_id=f"f-{i:02d}", user_message=f"failure {i}",
                  original_response=f"resp {i}", cohort="failure")
    for i in range(20)
] + [
    StubTraceSpec(trace_id=f"b-{i:02d}", user_message=f"baseline {i}",
                  original_response=f"resp {i}", cohort="baseline")
    for i in range(20)
]
source = StubTraceSource(specs=specs)

# In a real run, this comes from your runner + scorer.
# Here we make it deterministic so the example is reproducible.
def delta_fn(rt: RawTrace) -> float:
    return 0.4 if rt.cohort == "failure" else 0.05

report = run_pipeline(
    source=source,
    delta_fn=delta_fn,
    floor=TrustFloor(),
    policy=DecisionPolicy(),
    manifest=RunManifest(
        run_id="demo-run",
        whatifd_version="0.1.0",
        env=EnvironmentFingerprint(...),
    ),
    methodology=MethodologyDisclosure(...),
    cache_summary=CacheSummary(
        mode="off",
        profile=CachePolicySnapshot(...),
        hits=0, misses=0, writes=0,
        stale_hits=0, corrupted_entries=0,
        schema_version="v1", key_version="v1",
        models_distribution={},
    ),
)
print(report.verdict_state)        # "ship" / "dont_ship" / "inconclusive"
print(encode_report_v01(report))   # JSON

The full reference example with realistic methodology and cache-summary fields is in the library repo at docs/getting-started.md and tests/integration/test_real_adapters.py.

Path B — whatifd fork CLI (config-file driven)

The CLI is config-file driven. Create whatifd.config.yaml:

source:
  adapter: langfuse        # or "stub" for offline testing
target:
  runner: "python:my_agent.replay:run"
selection:
  failure_cohort:
    limit: 20
  baseline_cohort:
    limit: 20
change:
  system_prompt: "You are a senior on-call SRE..."
scorer:
  adapter: inspect_ai      # or "stub" for offline smoke tests
  score_fn: "python:my_pkg.scorers:faithfulness"
  judge_provider: anthropic
  judge_model_id: claude-haiku-4-5
  rubric_id: faithfulness-v1
  rubric_text: "Score 0-1 by faithfulness to the original output."
decision: {}               # use defaults
reporting:
  profile: default         # or "minimal" / "review" / "forensic"
timeouts:
  replay_seconds: 30.0
  score_seconds: 30.0

Set credentials when using real adapters:

export LANGFUSE_HOST="https://cloud.langfuse.com"
export LANGFUSE_PUBLIC_KEY="pk-lf-..."
export LANGFUSE_SECRET_KEY="sk-lf-..."
export ANTHROPIC_API_KEY="sk-ant-..."   # for the Inspect AI judge

Run the experiment:

whatifd fork --config whatifd.config.yaml

Implement the runner contract

Create my_agent/replay.py:

from whatifd.contract import TraceInput, ReplayConfig, ToolCache, ReplayOutput
from anthropic import Anthropic

client = Anthropic()

def run(
    trace_input: TraceInput,
    config: ReplayConfig,
    tool_cache: ToolCache,
) -> ReplayOutput:
    response = client.messages.create(
        model="claude-haiku-4-5",
        max_tokens=1024,
        system=config.system_prompt or "You are a helpful assistant.",
        messages=[{"role": "user", "content": trace_input.user_message}],
    )
    return ReplayOutput(text=response.content[0].text)

This is the smallest possible runner — no tool use, no multi-step. Real agents will be longer; the contract is the same. See Runner contract for richer agent shapes.

Read the verdict

whatifd fork writes a JSON report and a Markdown summary to ./reports/:

ls reports/
cat reports/whatifd-fork-*.md

You’ll see the five mandatory sections:

  1. VerdictShip | Don't Ship | Inconclusive

  2. Stats — improvement / regression counts per cohort

  3. Replay validity — how many traces actually replayed

  4. Baseline integrity — non-failure regression check

  5. Evidence — top improvements + top regressions with judge rationale

Exit codes

Verdict

Exit code

When this happens

Ship

0

Floor passed; no blocks_ship finding

Don't Ship

1

Floor passed; at least one blocks_ship finding (the regression case)

Inconclusive

2

Floor failed, or a blocks_all finding fired, or setup failure (bad config, unknown adapter, runner load error)

Exit 1 is what blocks the PR check in Path Z.

Decide

Use the report to decide. Either:

  • Ship: open the PR with the report attached as evidence.

  • Don’t ship: read the regression cases, refine the prompt, run again.

  • Inconclusive: act on the registered fix suggestion (every Inconclusive carries one — see the report anatomy).

The point of the loop is iterating in minutes rather than hours, with evidence at every step.

Next