Inspect AI¶
Inspect AI is whatif’s v0.1 default scorer. It’s a thoughtfully-designed evaluation framework from the UK AI Safety Institute; whatif wraps its scorers rather than reimplementing scoring from scratch.
How whatif uses Inspect AI¶
For each (original_output, replayed_output) pair, whatif:
Constructs an Inspect AI
Sample(input + target).Runs the configured Inspect scorer against the original output, then against the replayed output.
Records both scores plus the judge’s rationale.
The diff between the two becomes the per-trace delta in the report.
Configuration¶
--score "inspect_ai:faithfulness"
The colon-separated form is inspect_ai:<task-name>.
scorer:
type: inspect_ai
task: faithfulness_qa
judge_model: claude-haiku-4-5
Built-in scorers (v0.1)¶
Task |
What it measures |
When to use |
|---|---|---|
|
Whether the response is grounded in the provided context (no hallucination). |
RAG systems, agents that cite tool outputs. |
|
LLM-as-judge: did the response correctly answer the question? |
General-purpose Q&A. |
|
Exact / substring match against a target. |
Structured outputs, classification. |
|
Token-level F1 against a target. |
Extraction, summarization. |
Custom Inspect tasks¶
If your team already has Inspect tasks defined, point whatif at them:
scorer:
type: inspect_ai
task: my_team.tasks.triage_quality # Python module path
judge_model: claude-opus-4-7
whatif resolves the task from your Python path, so anything Inspect can run, whatif can score against.
Judge rationale¶
whatif requires the scorer to return both a numeric score and a rationale string. The rationale flows into the Evidence section of the verdict report.
Inspect AI’s model_graded_qa and faithfulness scorers produce rationales by default. If you write a custom scorer, ensure the rationale is populated-numbers without rationale are not trustworthy enough to ship from.
Why Inspect AI rather than X¶
Alternative |
Why not (yet) |
|---|---|
RAGAS |
Will be added in v0.2; covers RAG-specific metrics not in Inspect. |
Promptfoo |
Designed for golden-set CI; less natural fit for trace-driven workflows. |
Custom in-house scorers |
Plugin interface lands in v0.2; current path is a thin Inspect wrapper. |
Hand-rolled judges |
Possible via custom Inspect tasks. |
Cost considerations¶
LLM-as-judge scoring costs LLM tokens. A typical run:
40 traces (20 failures + 20 baseline) × 2 scorings (original + replayed) = 80 judge calls.
With Claude Haiku 4.5 as judge: ~$0.10–0.30 per
whatif forkinvocation.With Claude Opus as judge: ~$1–4 per invocation.
Use Haiku for high-frequency CI; reserve Opus for edge cases or release-gate runs.