Config reference

whatif.config.yaml is the version-controlled, GitOps-friendly way to declare an experiment. Introduced in v0.2; in v0.1 the same fields are passed as CLI flags.

Full example

# whatif.config.yaml
source:
  type: langfuse
  project: incident-triage-prod
  endpoint: ${LANGFUSE_HOST}

target: "python:my_agent.replay:run"

selection:
  mode: failures_plus_baseline      # or: failures_only
  failures:
    filter: "score-below:0.6"
    since: "24h"
    limit: 50
  baseline:
    filter: "score-above:0.8"
    since: "24h"
    limit: 50
    sample: random                  # required for CI reproducibility
    seed: 42                        # stable seed prevents sampling drift

scorer:
  type: inspect_ai
  task: faithfulness_qa
  judge_model: claude-haiku-4-5

cache:
  policy: use-original              # v0.3: live | mock
  live_allowlist: []                # populated only when policy=live

decision:
  fail_on_regression: true
  regression_threshold: 0.1         # median score drop > 0.1 = fail
  min_replay_validity: 0.7          # 70%+ traces must replay cleanly per cohort
  min_baseline_coverage: 1          # require baseline cohort or fail

source

Where to ingest traces from.

Field

Type

Notes

type

langfuse (v0.1) | phoenix (v0.2) | langsmith (v0.2) | otel-genai (v0.2)

The tracer adapter.

project

string

Tracer-specific project / workspace identifier.

endpoint

URL

Optional override for self-hosted instances.

Environment variables (${VAR}) are interpolated at load time.

target

The user-supplied runner. Must match the runner contract.

Field

Type

Notes

target

"python:<module.path>:<attr>"

Module-path syntax; whatif imports the module and resolves <attr> at runtime.

selection

Cohort assembly. Baseline runs by default. Disabling baseline (mode: failures_only) produces a loud warning in every report.

selection.mode

Value

Effect

failures_plus_baseline (default)

Both cohorts run. Verdict is high-confidence.

failures_only

Skip baseline. Verdict prints “confidence: limited” and the report’s Baseline integrity section is a structural warning.

selection.failures and selection.baseline

Field

Type

Notes

filter

selector grammar

e.g., "score-below:0.6", "tag:incident-triage".

since

duration

e.g., "24h", "7d". Bounds the time window.

limit

int

Maximum number of traces in the cohort.

sample

random | first | last

How to subsample when more matches than limit. random requires seed for reproducibility.

seed

int

Stable seed for sample: random. Use the same seed in CI to avoid red builds caused by sampling drift.

scorer

Wraps an existing eval framework. Receives a ScoreCase (see the runner contract) and returns a numeric score plus a rationale.

Field

Type

Notes

type

inspect_ai (v0.1) | ragas (v0.2) | custom plugin name

The wrapper to use.

task

string

Scorer-specific task identifier.

judge_model

string

LLM identifier when the scorer uses LLM-as-judge.

cache

Tool-call caching policy enforced during replay.

Field

Type

Notes

policy

use-original (v0.1, default) | live (v0.3) | mock (v0.3+)

See Runner contract → Cache policies.

live_allowlist

list[string]

When policy: live, only these tool names may call live. Empty = no live calls.

decision

Policy that turns the diff into an exit code.

Field

Type

Notes

fail_on_regression

bool

Exit 1 if the median score drop exceeds regression_threshold.

regression_threshold

float in [0, 1]

Default 0.1. Median delta below this triggers a Don't Ship verdict.

min_replay_validity

float in [0, 1]

Minimum fraction of traces that must replay cleanly per cohort. Default 0.7. Below this → Inconclusive (exit 2).

min_baseline_coverage

int

Minimum baseline traces required. 0 allows failures_only mode silently; >=1 enforces baseline coverage. Default 1.

Environment variable interpolation

Any string value supports ${VAR} substitution at load time. Useful for endpoints and credentials:

source:
  type: langfuse
  endpoint: ${LANGFUSE_HOST}
  project: ${LANGFUSE_PROJECT:-incident-triage-prod}   # default if unset