Config reference¶

whatifd fork is config-file driven. The default path is whatifd.config.yaml in the current directory; override with --config <path>. YAML and JSON are both accepted. The schema is strict (Pydantic v2 extra="forbid") — unknown fields fail at load time with a hint message, rather than silently absorb.

Full example¶

source:
  adapter: langfuse                    # or "stub" for offline use
target:
  runner: "python:my_agent.replay:run"
selection:
  failure_cohort:
    limit: 20
    filter: "score-below:0.6,since:24h"  # optional; adapter-interpreted
  baseline_cohort:
    limit: 20
    filter: "score-above:0.8,since:24h"
change:
  system_prompt: "You are a senior on-call SRE..."
  # model: "claude-haiku-4-5"          # optional model swap
scorer:
  adapter: inspect_ai                  # or "stub"
  cache_mode: auto                     # auto | on | off | read_only | refresh
decision:
  require_baseline: true
  max_baseline_regression_ratio: 0.10
  min_failure_improvement_ratio: 0.50
  practical_delta_epsilon: 0.05
  # max_ci_width: 0.5                  # optional cap on CI width
reporting:
  profile: default                     # default | review | minimal | forensic
timeouts:
  replay_seconds: 60.0
  score_seconds: 30.0

The five required top-level sections are source, target, selection, change, scorer. The decision, reporting, and timeouts sections are required as keys but every field has a default, so an empty decision: {} block is enough.

`experiment_shape`¶

Top-level field (v0.2). Controls which experiment shape the verdict pipeline runs.

Value	When to use	Selection requirements
`failure_rescue` (default)	You have a known-bad set of traces plus a proposed fix; verdict says whether the fix rescues failures without regressing the baseline.	Requires both `failure_cohort` and `baseline_cohort`.
`regression_check`	You have a known-good baseline plus a candidate change; verdict says whether the change introduces regressions.	Requires only `baseline_cohort`. `failure_cohort` is rejected at config-load with a named-field error if present.

The verdict-policy guard chain branches on this field: regression_check skips the failure-cohort-specific guards (practical_delta, improvement_observation) and runs the lean primary_endpoint + ci_availability chain. Floor + decision policy still apply.

Unknown values (e.g., exploratory_ab) fail at config-load with a Pydantic ValidationError naming the field.

`source`¶

Trace source adapter reference.

Field	Type	Default	Notes
`adapter`	string	required	Adapter name. Shipped today: `langfuse` (via `whatifd-langfuse`), `phoenix` (via `whatifd-phoenix`; Arize Phoenix / OpenInference, v0.2), or `stub` (in-tree synthetic).

Phoenix / OpenInference example¶

source:
  adapter: phoenix
  spans_provider: "python:my_pkg.phoenix_wiring:get_spans"   # required

spans_provider is a python:<module>:<attr> reference to a Callable[[], Iterable[dict]] that yields OpenInference-shaped span dicts. The adapter is tracer-neutral: the callable can wrap arize-phoenix-client, read a JSONL dump, or pull from any OTLP destination. See Arize Phoenix / OpenInference for wiring examples.

`target`¶

The user-supplied runner.

Field	Type	Default	Notes
`runner`	string	required	`python:<module.path>:<attr>`. Must satisfy the runner contract. Async runners are detected at import via `inspect.iscoroutinefunction` — both sync and async are supported, no flag needed.

`selection`¶

Per-cohort selection limits. Required sub-blocks depend on experiment_shape: failure_rescue (the v0.1 default) requires both failure_cohort and baseline_cohort — the failure-rescue verdict needs paired evidence under cardinal-#10. regression_check (added in v0.2) requires only baseline_cohort; the failure cohort is meaningless when the baseline itself is what’s under test.

Path	Type	Default	Notes
`selection.failure_cohort.limit`	int ≥ 1	required	Max traces in the failure cohort.
`selection.failure_cohort.filter`	string	none	Adapter-interpreted selector (e.g., Langfuse `score-below:0.6`). Optional.
`selection.baseline_cohort.limit`	int ≥ 1	required	Max traces in the baseline cohort.
`selection.baseline_cohort.filter`	string	none	Adapter-interpreted selector.

`change`¶

The proposed change. Mirrors whatifd.contract.ReplayConfig keys.

Field	Type	Default	Notes
`system_prompt`	string	none	The proposed system-prompt change.
`model`	string	none	Optional model swap (e.g., `claude-haiku-4-5`).

whatifd supports system_prompt and model. Other dimensions (tool list, temperature) remain on the roadmap.

`scorer`¶

Scorer adapter reference.

Field	Type	Default	Notes
`adapter`	enum	required	`inspect_ai` (real, via `whatifd-inspect-ai`) or `stub`. Unknown values fail at config-load.
`score_fn`	string \| null	`null`	`python:<module.path>:<attr>` reference to the Inspect AI score function. Required when `adapter: inspect_ai`.
`judge_provider`	string \| null	`null`	LLM judge provider (e.g., `anthropic`, `openai`). Required when `adapter: inspect_ai`.
`judge_model_id`	string \| null	`null`	Judge model identifier (e.g., `claude-haiku-4-5`). Required when `adapter: inspect_ai`.
`judge_model_snapshot`	string \| null	`null`	Optional snapshot pin (e.g., `claude-haiku-4-5-20251001`). Hashed into cache keys when set.
`rubric_id`	string \| null	`null`	Human-named rubric identifier. Required when `adapter: inspect_ai`.
`rubric_text`	string \| null	`null`	Literal rubric text; hashed into cache keys so a rubric edit invalidates entries. Required when `adapter: inspect_ai`.
`scoring_parameters`	dict[str, JSON-primitive]	`{}`	Optional knobs (temperature, max_tokens, …) passed through to the scorer. Values must be `str \| int \| float \| bool \| null`; nested structures rejected at config-load.
`cache_mode`	enum	`auto`	One of `auto`, `on`, `off`, `read_only`, `refresh`. `auto` infers `on` when `CI=true`, else `auto` passes through. See cardinal #10 reproducibility note.

When adapter: inspect_ai, the five required fields (score_fn, judge_provider, judge_model_id, rubric_id, rubric_text) are enforced by a Pydantic cross-field validator at config-load time; misconfigured runs fail with a named-field error before any adapter machinery starts up. When adapter: stub, the inspect_ai-specific fields are silently accepted (so a config block can be retargeted from stub to inspect_ai with one keystroke during development).

`decision`¶

Above-floor policy thresholds. These mirror whatifd.types.policy.DecisionPolicy.

Field	Type	Default	Notes
`require_baseline`	bool	`true`	Floor refuses runs without a baseline cohort.
`max_baseline_regression_ratio`	float in `[0, 1]`	`0.10`	If more than 10% of baseline traces regress, verdict is `Don't Ship`.
`min_failure_improvement_ratio`	float in `[0, 1]`	`0.50`	If less than 50% of failure-cohort traces improve, the change isn’t rescuing → `Don't Ship`.
`practical_delta_epsilon`	float ≥ 0	`0.05`	Minimum effect size to call a delta “practical.” Below this is “no meaningful change.”
`max_ci_width`	float ≥ 0 \| null	`null`	Optional cap on CI width. The lever for accepting wider CIs without flipping `ci_meaningful` per cardinal #10.

The trust floor (which cannot be overridden by config — see cardinal #2) sits structurally below this. Floor failures produce Inconclusive regardless of policy.

`reporting`¶

Field	Type	Default	Notes
`profile`	enum	`default`	One of `default`, `review`, `minimal`, `forensic`. Controls `Sensitive[T]` redaction depth and which artifacts get written (cardinal #5).
`forensic_acknowledgment`	block	none	Required when `profile: forensic`. Cardinal #7 two-affirmation: this block plus `--profile forensic` on the CLI must both be present. Single-surface attempts fail at startup with `ForensicAffirmationError`.
`forensic_acknowledgment.accepted_by`	string	required (in block)	Operator ID.
`forensic_acknowledgment.accepted_at`	ISO 8601 string	required (in block)	Date or datetime.
`forensic_acknowledgment.reason`	string	required (in block)	Free-text justification recorded in the manifest.

`timeouts`¶

Field	Type	Default	Notes
`replay_seconds`	float > 0	`60.0`	Per-trace replay wall-clock budget. Exceeded → `ReplayFailure(code="runner_timeout")`.
`score_seconds`	float > 0	`30.0`	Per-trace scoring budget.

Validation hints¶

Loaded by whatifd.config.load_config(...). On validation failure, format_validation_errors translates Pydantic’s stack-trace-flavored messages into multi-line per-error output with suggestions for the most common typos (e.g., forensic_ackn0wledgment → “did you mean forensic_acknowledgment?”). All hints fall back to the raw Pydantic message — operators see useful output either way.

Config reference¶

Full example¶

experiment_shape¶

source¶

Phoenix / OpenInference example¶

target¶

selection¶

change¶

scorer¶

decision¶

reporting¶

timeouts¶