Config reference

whatifd fork is config-file driven. The default path is whatifd.config.yaml in the current directory; override with --config <path>. YAML and JSON are both accepted. The schema is strict (Pydantic v2 extra="forbid") — unknown fields fail at load time with a hint message, rather than silently absorb.

Full example

source:
  adapter: langfuse                    # or "stub" for offline use
target:
  runner: "python:my_agent.replay:run"
selection:
  failure_cohort:
    limit: 20
    filter: "score-below:0.6,since:24h"  # optional; adapter-interpreted
  baseline_cohort:
    limit: 20
    filter: "score-above:0.8,since:24h"
change:
  system_prompt: "You are a senior on-call SRE..."
  # model: "claude-haiku-4-5"          # optional model swap
scorer:
  adapter: inspect_ai                  # or "stub"
  cache_mode: auto                     # auto | on | off | read_only | refresh
decision:
  require_baseline: true
  max_baseline_regression_ratio: 0.10
  min_failure_improvement_ratio: 0.50
  practical_delta_epsilon: 0.05
  # max_ci_width: 0.5                  # optional cap on CI width
reporting:
  profile: default                     # default | review | minimal | forensic
timeouts:
  replay_seconds: 60.0
  score_seconds: 30.0

The five required top-level sections are source, target, selection, change, scorer. The decision, reporting, and timeouts sections are required as keys but every field has a default, so an empty decision: {} block is enough.

experiment_shape

Top-level field (v0.2). Controls which experiment shape the verdict pipeline runs.

Value

When to use

Selection requirements

failure_rescue (default)

You have a known-bad set of traces plus a proposed fix; verdict says whether the fix rescues failures without regressing the baseline.

Requires both failure_cohort and baseline_cohort.

regression_check

You have a known-good baseline plus a candidate change; verdict says whether the change introduces regressions.

Requires only baseline_cohort. failure_cohort is rejected at config-load with a named-field error if present.

The verdict-policy guard chain branches on this field: regression_check skips the failure-cohort-specific guards (practical_delta, improvement_observation) and runs the lean primary_endpoint + ci_availability chain. Floor + decision policy still apply.

Unknown values (e.g., exploratory_ab) fail at config-load with a Pydantic ValidationError naming the field.

source

Trace source adapter reference.

Field

Type

Default

Notes

adapter

string

required

Adapter name. Shipped today: langfuse (via whatifd-langfuse), phoenix (via whatifd-phoenix; Arize Phoenix / OpenInference, v0.2), or stub (in-tree synthetic).

Phoenix / OpenInference example

source:
  adapter: phoenix
  spans_provider: "python:my_pkg.phoenix_wiring:get_spans"   # required

spans_provider is a python:<module>:<attr> reference to a Callable[[], Iterable[dict]] that yields OpenInference-shaped span dicts. The adapter is tracer-neutral: the callable can wrap arize-phoenix-client, read a JSONL dump, or pull from any OTLP destination. See Arize Phoenix / OpenInference for wiring examples.

target

The user-supplied runner.

Field

Type

Default

Notes

runner

string

required

python:<module.path>:<attr>. Must satisfy the runner contract. Async runners are detected at import via inspect.iscoroutinefunction — both sync and async are supported, no flag needed.

selection

Per-cohort selection limits. Required sub-blocks depend on experiment_shape: failure_rescue (the v0.1 default) requires both failure_cohort and baseline_cohort — the failure-rescue verdict needs paired evidence under cardinal-#10. regression_check (added in v0.2) requires only baseline_cohort; the failure cohort is meaningless when the baseline itself is what’s under test.

Path

Type

Default

Notes

selection.failure_cohort.limit

int ≥ 1

required

Max traces in the failure cohort.

selection.failure_cohort.filter

string

none

Adapter-interpreted selector (e.g., Langfuse score-below:0.6). Optional.

selection.baseline_cohort.limit

int ≥ 1

required

Max traces in the baseline cohort.

selection.baseline_cohort.filter

string

none

Adapter-interpreted selector.

change

The proposed change. Mirrors whatifd.contract.ReplayConfig keys.

Field

Type

Default

Notes

system_prompt

string

none

The proposed system-prompt change.

model

string

none

Optional model swap (e.g., claude-haiku-4-5).

whatifd supports system_prompt and model. Other dimensions (tool list, temperature) remain on the roadmap.

scorer

Scorer adapter reference.

Field

Type

Default

Notes

adapter

enum

required

inspect_ai (real, via whatifd-inspect-ai) or stub. Unknown values fail at config-load.

score_fn

string | null

null

python:<module.path>:<attr> reference to the Inspect AI score function. Required when adapter: inspect_ai.

judge_provider

string | null

null

LLM judge provider (e.g., anthropic, openai). Required when adapter: inspect_ai.

judge_model_id

string | null

null

Judge model identifier (e.g., claude-haiku-4-5). Required when adapter: inspect_ai.

judge_model_snapshot

string | null

null

Optional snapshot pin (e.g., claude-haiku-4-5-20251001). Hashed into cache keys when set.

rubric_id

string | null

null

Human-named rubric identifier. Required when adapter: inspect_ai.

rubric_text

string | null

null

Literal rubric text; hashed into cache keys so a rubric edit invalidates entries. Required when adapter: inspect_ai.

scoring_parameters

dict[str, JSON-primitive]

{}

Optional knobs (temperature, max_tokens, …) passed through to the scorer. Values must be str | int | float | bool | null; nested structures rejected at config-load.

cache_mode

enum

auto

One of auto, on, off, read_only, refresh. auto infers on when CI=true, else auto passes through. See cardinal #10 reproducibility note.

When adapter: inspect_ai, the five required fields (score_fn, judge_provider, judge_model_id, rubric_id, rubric_text) are enforced by a Pydantic cross-field validator at config-load time; misconfigured runs fail with a named-field error before any adapter machinery starts up. When adapter: stub, the inspect_ai-specific fields are silently accepted (so a config block can be retargeted from stub to inspect_ai with one keystroke during development).

decision

Above-floor policy thresholds. These mirror whatifd.types.policy.DecisionPolicy.

Field

Type

Default

Notes

require_baseline

bool

true

Floor refuses runs without a baseline cohort.

max_baseline_regression_ratio

float in [0, 1]

0.10

If more than 10% of baseline traces regress, verdict is Don't Ship.

min_failure_improvement_ratio

float in [0, 1]

0.50

If less than 50% of failure-cohort traces improve, the change isn’t rescuing → Don't Ship.

practical_delta_epsilon

float ≥ 0

0.05

Minimum effect size to call a delta “practical.” Below this is “no meaningful change.”

max_ci_width

float ≥ 0 | null

null

Optional cap on CI width. The lever for accepting wider CIs without flipping ci_meaningful per cardinal #10.

The trust floor (which cannot be overridden by config — see cardinal #2) sits structurally below this. Floor failures produce Inconclusive regardless of policy.

reporting

Field

Type

Default

Notes

profile

enum

default

One of default, review, minimal, forensic. Controls Sensitive[T] redaction depth and which artifacts get written (cardinal #5).

forensic_acknowledgment

block

none

Required when profile: forensic. Cardinal #7 two-affirmation: this block plus --profile forensic on the CLI must both be present. Single-surface attempts fail at startup with ForensicAffirmationError.

forensic_acknowledgment.accepted_by

string

required (in block)

Operator ID.

forensic_acknowledgment.accepted_at

ISO 8601 string

required (in block)

Date or datetime.

forensic_acknowledgment.reason

string

required (in block)

Free-text justification recorded in the manifest.

timeouts

Field

Type

Default

Notes

replay_seconds

float > 0

60.0

Per-trace replay wall-clock budget. Exceeded → ReplayFailure(code="runner_timeout").

score_seconds

float > 0

30.0

Per-trace scoring budget.

Validation hints

Loaded by whatifd.config.load_config(...). On validation failure, format_validation_errors translates Pydantic’s stack-trace-flavored messages into multi-line per-error output with suggestions for the most common typos (e.g., forensic_ackn0wledgment → “did you mean forensic_acknowledgment?”). All hints fall back to the raw Pydantic message — operators see useful output either way.