Report anatomy¶

A whatif verdict report is the unit of trust. Every report has five mandatory sections in this order. A report missing any section is a bug in whatif, not a trade-off.

1. Verdict¶

One line:

Verdict: Ship | Don't Ship | Inconclusive

Derived from your declared decision: policy in the config (regression threshold, minimum replay validity, baseline coverage requirement).

2. Stats¶

Improved / unchanged / regressed counts, median deltas, bootstrap confidence intervals-broken out by cohort (failures vs baseline). Example:

Failures (20):   improved 14   unchanged 4   regressed 2   median Δ +0.31  CI [+0.18, +0.44]
Baseline (20):   improved  3   unchanged 16  regressed 1   median Δ +0.02  CI [-0.01, +0.05]

Per-cohort breakdown is non-negotiable. Aggregate-only stats hide the silent-regression failure mode.

3. Replay validity¶

How many traces could actually be replayed, and why any were skipped:

Replayed: 17/20 failures, 18/20 baseline
Skipped:  3 failures (2 missing tool outputs in cache, 1 schema mismatch)
          2 baseline (live-only tool not allowlisted)

Without this section, users can’t trust failures-they don’t know if a “0% improvement” is a real result or just a replay miss.

4. Baseline integrity¶

If baseline cohort ran: improvement / regression rates broken out separately so silent regression on previously-good traces is visible.

If baseline did not run (selection.mode: failures_only):

Danger

Baseline integrity: NOT TESTED This run only evaluated known failures. The proposed change may regress previously-successful traces. Verdict confidence: limited.

The warning is structural-the report is a bug if it’s missing when baseline didn’t run.

5. Evidence¶

Top three representative improvements (trace IDs, before / after snippets, score deltas, and the judge’s rationale for why the scorer marked it improved) plus top three representative regressions (same-with rationale for the regression). Drawn from both cohorts. Plus links back to the source tracer for full context.

Tip

Numbers without rationale are not trustworthy enough to ship from. The judge rationale is what makes the report reviewable. A reviewer should be able to read three improved cases and three regressed cases and form an informed opinion in under five minutes.

Why this shape¶

The report is designed for a single use:

A reviewer opens a PR. The CI check has run whatif. The reviewer reads the verdict, scans the stats and replay validity, confirms baseline integrity is green, reads the six evidence cases, and forms an opinion.

Five minutes. Defensible decision. The shape exists to make that interaction reliable.

Output formats¶

whatif fork produces both:

Markdown-human-readable, designed to be pasted into a PR comment or read in an editor.
JSON-machine-readable, designed to be parsed by CI logic, dashboards, or downstream automation.

The JSON output is stable across patch versions; breaking changes require a minor version bump and a CHANGELOG note.

What the report is not¶

Not a safety certification. whatif enforces your declared decision policy. It does not certify the absence of bugs, harms, or regressions outside the dimensions you scored.
Not a replacement for human review. It’s input to human review. The five-section shape exists to make that review fast and grounded.
Not a substitute for monitoring. Production drift is a different problem; for it, see the Path Z trajectory and your existing observability stack.