Config reference¶
whatif.config.yaml is the version-controlled, GitOps-friendly way to declare an experiment. Introduced in v0.2; in v0.1 the same fields are passed as CLI flags.
Full example¶
# whatif.config.yaml
source:
type: langfuse
project: incident-triage-prod
endpoint: ${LANGFUSE_HOST}
target: "python:my_agent.replay:run"
selection:
mode: failures_plus_baseline # or: failures_only
failures:
filter: "score-below:0.6"
since: "24h"
limit: 50
baseline:
filter: "score-above:0.8"
since: "24h"
limit: 50
sample: random # required for CI reproducibility
seed: 42 # stable seed prevents sampling drift
scorer:
type: inspect_ai
task: faithfulness_qa
judge_model: claude-haiku-4-5
cache:
policy: use-original # v0.3: live | mock
live_allowlist: [] # populated only when policy=live
decision:
fail_on_regression: true
regression_threshold: 0.1 # median score drop > 0.1 = fail
min_replay_validity: 0.7 # 70%+ traces must replay cleanly per cohort
min_baseline_coverage: 1 # require baseline cohort or fail
source¶
Where to ingest traces from.
Field |
Type |
Notes |
|---|---|---|
|
|
The tracer adapter. |
|
string |
Tracer-specific project / workspace identifier. |
|
URL |
Optional override for self-hosted instances. |
Environment variables (${VAR}) are interpolated at load time.
target¶
The user-supplied runner. Must match the runner contract.
Field |
Type |
Notes |
|---|---|---|
|
|
Module-path syntax; |
selection¶
Cohort assembly. Baseline runs by default. Disabling baseline (mode: failures_only) produces a loud warning in every report.
selection.mode¶
Value |
Effect |
|---|---|
|
Both cohorts run. Verdict is high-confidence. |
|
Skip baseline. Verdict prints “confidence: limited” and the report’s Baseline integrity section is a structural warning. |
selection.failures and selection.baseline¶
Field |
Type |
Notes |
|---|---|---|
|
selector grammar |
e.g., |
|
duration |
e.g., |
|
int |
Maximum number of traces in the cohort. |
|
|
How to subsample when more matches than |
|
int |
Stable seed for |
scorer¶
Wraps an existing eval framework. Receives a ScoreCase (see the runner contract) and returns a numeric score plus a rationale.
Field |
Type |
Notes |
|---|---|---|
|
|
The wrapper to use. |
|
string |
Scorer-specific task identifier. |
|
string |
LLM identifier when the scorer uses LLM-as-judge. |
cache¶
Tool-call caching policy enforced during replay.
Field |
Type |
Notes |
|---|---|---|
|
|
|
|
list[string] |
When |
decision¶
Policy that turns the diff into an exit code.
Field |
Type |
Notes |
|---|---|---|
|
bool |
Exit |
|
float in |
Default |
|
float in |
Minimum fraction of traces that must replay cleanly per cohort. Default |
|
int |
Minimum baseline traces required. |
Environment variable interpolation¶
Any string value supports ${VAR} substitution at load time. Useful for endpoints and credentials:
source:
type: langfuse
endpoint: ${LANGFUSE_HOST}
project: ${LANGFUSE_PROJECT:-incident-triage-prod} # default if unset