FAQ

Honest answers to the questions that came up while designing whatifd.

Positioning

Isn’t this what Braintrust / Langfuse / LangSmith already do?

Pieces of it, yes. The honest reading from the prior-art section of DESIGN.md:

  • Braintrust has trace-to-dataset experiments and CI release gating, but it’s a closed SaaS product bound to its own stack.

  • Langfuse added experiments as a first-class concept in April 2026-but they run inside the Langfuse UI, not as a CLI-native PR gate.

  • LangSmith has backtesting cookbooks, also UI/SDK-driven inside the LangChain stack.

  • Promptfoo is excellent at CLI/CI ergonomics, but its model is golden-set evals, not production-trace-driven experiments.

whatifd’s wedge is open + CLI-native + tracer-neutral + PR-ready-none of those four cells are filled by a single existing tool.

Why not just write 100 lines of Bash glue?

You can. Many teams do. But:

  • Glue scripts skip baseline cohorts, so they fix failures while silently regressing successes.

  • Glue scripts rarely include replay-validity reporting, so silent cache misses look like real regressions.

  • Glue scripts almost never include LLM-judge rationale in the output, so reviewers can’t validate the verdict.

  • Glue scripts don’t run in CI without a second round of work.

whatifd packages the discipline that the glue version usually skips.

Is this OSS or commercial?

Apache 2.0. OSS. The library is the product.

Workflow

Can I use this without baseline traces?

No. In v0.2, experiment_shape: failure_rescue (the default) requires both selection.failure_cohort and selection.baseline_cohort — a failure-rescue verdict is only defensible against a paired baseline. There is no selection.mode toggle; the config schema rejects single-cohort YAML at load. If you only have a baseline and want to test a candidate change for regressions, use experiment_shape: regression_check (requires only baseline_cohort; rejects failure_cohort).

What if my agent’s tools have side effects (DB writes, payments, emails)?

The default cache: use-original policy reuses cached tool outputs from the original trace. Side effects don’t re-fire. Live tool replay is opt-in (v0.3) with per-tool allowlists, so destructive tools require explicit consent.

What happens if a trace can’t be replayed?

It’s recorded as a replay failure in the Replay validity section. The verdict aggregation excludes it. If too many traces fail to replay (default threshold: 30%), the verdict becomes Inconclusive (exit code 2).

How long does a typical run take?

For 40 traces (20 + 20) with Claude Haiku 4.5 as judge: ~3–8 minutes wall-clock, depending on concurrency and your runner’s latency. Most of the time is in the LLM calls (replay + scoring).

Engineering

Why a runner contract instead of just running my agent automatically?

A trace is not executable. It’s a record of inputs, outputs, and intermediate state-it has no run() function. You have to tell whatifd how to reconstitute your agent for a single trace, with the proposed config change applied. That’s the runner contract. Three options were considered (user-supplied target, framework-specific replay, single-LLM-call replay) and the user-supplied target won on generality.

Why Pydantic for the contract?

Type clarity at the boundary, validation on construction, serialization for the JSON report, IDE autocomplete for users implementing the runner. Standard choice for modern Python public APIs.

Will you support {LangGraph, AutoGen, CrewAI, OpenAI Assistants}?

The runner contract is framework-agnostic by design-anything that’s a Python callable can be the - -target`. v0.1 ships a reference adapter for the raw Anthropic SDK; v0.1.1 adds LangChain and LangGraph stubs/docs. AutoGen, CrewAI, OpenAI Assistants are community-contributable in the same shape.

CI / Path Z

Is there a GitHub Action?

Yes — shipped in v0.2 as a composite action at .github/actions/whatifd-fork/ in the whatifd repo. Wraps whatifd fork --config, posts the Markdown verdict as a PR comment, and surfaces the verdict via a GitHub status annotation. See the GitHub Actions integration page. Marketplace publication is on the v0.3 roadmap; consume directly from the repo via uses: victoralfred/whatifd/.github/actions/whatifd-fork@v0.2.0 today.

Will whatifd support GitLab CI / Jenkins / CircleCI?

The CLI already does-anything that runs a binary and reads exit codes works. The whatifd-action is just the GitHub-specific wrapper. GitLab / Jenkins / CircleCI examples are in the GitHub Actions integration page.

How do I prevent flaky verdicts in CI?

Two things, both built in:

  1. Stable baseline sampling via selection.baseline.sample: random + seed: 42. Without a stable seed, baseline drift can flip a green build red between runs.

  2. Decision policy thresholds in config-min_replay_validity, regression_threshold. Tune these to your tolerance.

Project status

Has v0.2 shipped?

Yes — v0.3.0 is on PyPI as the whatifd distribution (along with whatifd-langfuse, whatifd-inspect-ai, whatifd-phoenix, and whatifd-datadog). Track changes on GitHub releases.

Is this safe to depend on?

Alpha. Public API may change between minor versions until 1.0 — see the CHANGELOG for a record of all breaking changes. The ReportV01 JSON schema (published at https://whatif.codes/schema/report/v0.2.json; v0.1 remains at https://whatif.codes/schema/report/v0.1.json for archived reports) is the most stable surface; consumers reading reports should validate against it.

How do I report a security issue?

Don’t open a public GitHub issue. Use the private vulnerability reporting flow on GitHub. See the project’s SECURITY.md for the disclosure policy.