Inspect AI

Inspect AI is whatif’s v0.1 default scorer. It’s a thoughtfully-designed evaluation framework from the UK AI Safety Institute; whatif wraps its scorers rather than reimplementing scoring from scratch.

How whatif uses Inspect AI

For each (original_output, replayed_output) pair, whatif:

  1. Constructs an Inspect AI Sample (input + target).

  2. Runs the configured Inspect scorer against the original output, then against the replayed output.

  3. Records both scores plus the judge’s rationale.

  4. The diff between the two becomes the per-trace delta in the report.

Configuration

--score "inspect_ai:faithfulness"

The colon-separated form is inspect_ai:<task-name>.

scorer:
  type: inspect_ai
  task: faithfulness_qa
  judge_model: claude-haiku-4-5

Built-in scorers (v0.1)

Task

What it measures

When to use

faithfulness

Whether the response is grounded in the provided context (no hallucination).

RAG systems, agents that cite tool outputs.

model_graded_qa

LLM-as-judge: did the response correctly answer the question?

General-purpose Q&A.

match

Exact / substring match against a target.

Structured outputs, classification.

f1

Token-level F1 against a target.

Extraction, summarization.

Custom Inspect tasks

If your team already has Inspect tasks defined, point whatif at them:

scorer:
  type: inspect_ai
  task: my_team.tasks.triage_quality   # Python module path
  judge_model: claude-opus-4-7

whatif resolves the task from your Python path, so anything Inspect can run, whatif can score against.

Judge rationale

whatif requires the scorer to return both a numeric score and a rationale string. The rationale flows into the Evidence section of the verdict report.

Inspect AI’s model_graded_qa and faithfulness scorers produce rationales by default. If you write a custom scorer, ensure the rationale is populated-numbers without rationale are not trustworthy enough to ship from.

Why Inspect AI rather than X

Alternative

Why not (yet)

RAGAS

Will be added in v0.2; covers RAG-specific metrics not in Inspect.

Promptfoo

Designed for golden-set CI; less natural fit for trace-driven workflows.

Custom in-house scorers

Plugin interface lands in v0.2; current path is a thin Inspect wrapper.

Hand-rolled judges

Possible via custom Inspect tasks.

Cost considerations

LLM-as-judge scoring costs LLM tokens. A typical run:

  • 40 traces (20 failures + 20 baseline) × 2 scorings (original + replayed) = 80 judge calls.

  • With Claude Haiku 4.5 as judge: ~$0.10–0.30 per whatif fork invocation.

  • With Claude Opus as judge: ~$1–4 per invocation.

Use Haiku for high-frequency CI; reserve Opus for edge cases or release-gate runs.