Your first experiment¶
This page walks you from a fresh install to a verdict report. Two paths are shown: the programmatic API (works fully today against any custom delta_fn) and the whatifd fork CLI (config-file driven; works fully against the synthetic stub adapters and against real Langfuse + Inspect AI when credentials are wired).
The walkthrough below covers the failure_rescue experiment shape (the v0.1 default — a known-bad set of traces plus a proposed fix). For the other shape shipped in v0.2 — regression_check (a known-good baseline plus a candidate change, verdict on whether the change introduces regressions) — see the regression-check walkthrough. The CLI surface, exit codes, and report wire format are identical between the two shapes; only the verdict-policy guard chain and the required cohort set differ.
Path A — Programmatic API (no credentials needed)¶
The fastest way to see a real ReportV01 come out the other end is to drive the pipeline directly with the in-tree synthetic stub source. This runs entirely offline.
from whatifd.adapters.stub import StubTraceSource, StubTraceSpec
from whatifd.adapters.protocols import RawTrace
from whatifd.cache.summary import CachePolicySnapshot, CacheSummary
from whatifd.pipeline import run_pipeline
from whatifd.serialization import encode_report_v01
from whatifd.types.manifest import EnvironmentFingerprint, RunManifest
from whatifd.types.policy import DecisionPolicy, TrustFloor
from whatifd.types.statistical import (
BootstrapMethodDisclosure,
EffectSizeDisclosure,
JudgeMethodDisclosure,
MethodologyDisclosure,
MultiplicityDisclosure,
)
specs = [
StubTraceSpec(trace_id=f"f-{i:02d}", user_message=f"failure {i}",
original_response=f"resp {i}", cohort="failure")
for i in range(20)
] + [
StubTraceSpec(trace_id=f"b-{i:02d}", user_message=f"baseline {i}",
original_response=f"resp {i}", cohort="baseline")
for i in range(20)
]
source = StubTraceSource(specs=specs)
# In a real run, this comes from your runner + scorer.
# Here we make it deterministic so the example is reproducible.
def delta_fn(rt: RawTrace) -> float:
return 0.4 if rt.cohort == "failure" else 0.05
report = run_pipeline(
source=source,
delta_fn=delta_fn,
floor=TrustFloor(),
policy=DecisionPolicy(),
manifest=RunManifest(
run_id="demo-run",
whatifd_version="0.1.0",
env=EnvironmentFingerprint(...),
),
methodology=MethodologyDisclosure(...),
cache_summary=CacheSummary(
mode="off",
profile=CachePolicySnapshot(...),
hits=0, misses=0, writes=0,
stale_hits=0, corrupted_entries=0,
schema_version="v1", key_version="v1",
models_distribution={},
),
)
print(report.verdict_state) # "ship" / "dont_ship" / "inconclusive"
print(encode_report_v01(report)) # JSON
The full reference example with realistic methodology and cache-summary fields is in the library repo at docs/getting-started.md and tests/integration/test_real_adapters.py.
Path B — whatifd fork CLI (config-file driven)¶
The CLI is config-file driven. Create whatifd.config.yaml:
source:
adapter: langfuse # or "stub" for offline testing
target:
runner: "python:my_agent.replay:run"
selection:
failure_cohort:
limit: 20
baseline_cohort:
limit: 20
change:
system_prompt: "You are a senior on-call SRE..."
scorer:
adapter: inspect_ai # or "stub" for offline smoke tests
score_fn: "python:my_pkg.scorers:faithfulness"
judge_provider: anthropic
judge_model_id: claude-haiku-4-5
rubric_id: faithfulness-v1
rubric_text: "Score 0-1 by faithfulness to the original output."
decision: {} # use defaults
reporting:
profile: default # or "minimal" / "review" / "forensic"
timeouts:
replay_seconds: 30.0
score_seconds: 30.0
Set credentials when using real adapters:
export LANGFUSE_HOST="https://cloud.langfuse.com"
export LANGFUSE_PUBLIC_KEY="pk-lf-..."
export LANGFUSE_SECRET_KEY="sk-lf-..."
export ANTHROPIC_API_KEY="sk-ant-..." # for the Inspect AI judge
Run the experiment:
whatifd fork --config whatifd.config.yaml
Implement the runner contract¶
Create my_agent/replay.py:
from whatifd.contract import TraceInput, ReplayConfig, ToolCache, ReplayOutput
from anthropic import Anthropic
client = Anthropic()
def run(
trace_input: TraceInput,
config: ReplayConfig,
tool_cache: ToolCache,
) -> ReplayOutput:
response = client.messages.create(
model="claude-haiku-4-5",
max_tokens=1024,
system=config.system_prompt or "You are a helpful assistant.",
messages=[{"role": "user", "content": trace_input.user_message}],
)
return ReplayOutput(text=response.content[0].text)
This is the smallest possible runner — no tool use, no multi-step. Real agents will be longer; the contract is the same. See Runner contract for richer agent shapes.
Read the verdict¶
whatifd fork writes a JSON report and a Markdown summary to ./reports/:
ls reports/
cat reports/whatifd-fork-*.md
You’ll see the five mandatory sections:
Verdict —
Ship | Don't Ship | InconclusiveStats — improvement / regression counts per cohort
Replay validity — how many traces actually replayed
Baseline integrity — non-failure regression check
Evidence — top improvements + top regressions with judge rationale
Exit codes¶
Verdict |
Exit code |
When this happens |
|---|---|---|
|
|
Floor passed; no |
|
|
Floor passed; at least one |
|
|
Floor failed, or a |
Exit 1 is what blocks the PR check in Path Z.
Decide¶
Use the report to decide. Either:
Ship: open the PR with the report attached as evidence.
Don’t ship: read the regression cases, refine the prompt, run again.
Inconclusive: act on the registered fix suggestion (every Inconclusive carries one — see the report anatomy).
The point of the loop is iterating in minutes rather than hours, with evidence at every step.
Next¶
Read the report anatomy to understand what each section is telling you.
Read Runner contract for richer agent shapes.
See the config reference for the full
whatifd.config.yamlschema.