Inspect AI¶

Inspect AI is whatif’s v0.1 default scorer. It’s a thoughtfully-designed evaluation framework from the UK AI Safety Institute; whatif wraps its scorers rather than reimplementing scoring from scratch.

How `whatif` uses Inspect AI¶

For each (original_output, replayed_output) pair, whatif:

Constructs an Inspect AI Sample (input + target).
Runs the configured Inspect scorer against the original output, then against the replayed output.
Records both scores plus the judge’s rationale.
The diff between the two becomes the per-trace delta in the report.

Configuration¶

CLI (v0.1)

--score "inspect_ai:faithfulness"

The colon-separated form is inspect_ai:<task-name>.

YAML (v0.2)

scorer:
  type: inspect_ai
  task: faithfulness_qa
  judge_model: claude-haiku-4-5

Built-in scorers (v0.1)¶

Task	What it measures	When to use
`faithfulness`	Whether the response is grounded in the provided context (no hallucination).	RAG systems, agents that cite tool outputs.
`model_graded_qa`	LLM-as-judge: did the response correctly answer the question?	General-purpose Q&A.
`match`	Exact / substring match against a target.	Structured outputs, classification.
`f1`	Token-level F1 against a target.	Extraction, summarization.

Custom Inspect tasks¶

If your team already has Inspect tasks defined, point whatif at them:

scorer:
  type: inspect_ai
  task: my_team.tasks.triage_quality   # Python module path
  judge_model: claude-opus-4-7

whatif resolves the task from your Python path, so anything Inspect can run, whatif can score against.

Judge rationale¶

whatif requires the scorer to return both a numeric score and a rationale string. The rationale flows into the Evidence section of the verdict report.

Inspect AI’s model_graded_qa and faithfulness scorers produce rationales by default. If you write a custom scorer, ensure the rationale is populated-numbers without rationale are not trustworthy enough to ship from.

Why Inspect AI rather than X¶

Alternative	Why not (yet)
RAGAS	Will be added in v0.2; covers RAG-specific metrics not in Inspect.
Promptfoo	Designed for golden-set CI; less natural fit for trace-driven workflows.
Custom in-house scorers	Plugin interface lands in v0.2; current path is a thin Inspect wrapper.
Hand-rolled judges	Possible via custom Inspect tasks.

Cost considerations¶

LLM-as-judge scoring costs LLM tokens. A typical run:

40 traces (20 failures + 20 baseline) × 2 scorings (original + replayed) = 80 judge calls.
With Claude Haiku 4.5 as judge: ~$0.10–0.30 per whatif fork invocation.
With Claude Opus as judge: ~$1–4 per invocation.

Use Haiku for high-frequency CI; reserve Opus for edge cases or release-gate runs.