FAQ¶
Honest answers to the questions that came up while designing whatif.
Positioning¶
Isn’t this what Braintrust / Langfuse / LangSmith already do?
Pieces of it, yes. The honest reading from our prior-art table:
Braintrust has trace-to-dataset experiments and CI release gating, but it’s a closed SaaS product bound to its own stack.
Langfuse added experiments as a first-class concept in April 2026-but they run inside the Langfuse UI, not as a CLI-native PR gate.
LangSmith has backtesting cookbooks, also UI/SDK-driven inside the LangChain stack.
Promptfoo is excellent at CLI/CI ergonomics, but its model is golden-set evals, not production-trace-driven experiments.
whatif’s wedge is open + CLI-native + tracer-neutral + PR-ready-none of those four cells are filled by a single existing tool.
Why not just write 100 lines of Bash glue?
You can. Many teams do. But:
Glue scripts skip baseline cohorts, so they fix failures while silently regressing successes.
Glue scripts rarely include replay-validity reporting, so silent cache misses look like real regressions.
Glue scripts almost never include LLM-judge rationale in the output, so reviewers can’t validate the verdict.
Glue scripts don’t run in CI without a second round of work.
whatif packages the discipline that the glue version usually skips.
Is this OSS or commercial?
Apache 2.0. OSS. The library is the product.
Workflow¶
Can I use this without baseline traces?
Yes-selection.mode: failures_only. But the report will print “Verdict confidence: limited” and the Baseline integrity section will be a structural warning. The default is failures_plus_baseline for a reason.
What if my agent’s tools have side effects (DB writes, payments, emails)?
The default cache: use-original policy reuses cached tool outputs from the original trace. Side effects don’t re-fire. Live tool replay is opt-in (v0.3) with per-tool allowlists, so destructive tools require explicit consent.
What happens if a trace can’t be replayed?
It’s recorded as a replay failure in the Replay validity section. The verdict aggregation excludes it. If too many traces fail to replay (default threshold: 30%), the verdict becomes Inconclusive (exit code 2).
How long does a typical run take?
For 40 traces (20 + 20) with Claude Haiku 4.5 as judge: ~3–8 minutes wall-clock, depending on concurrency and your runner’s latency. Most of the time is in the LLM calls (replay + scoring).
Engineering¶
Why a runner contract instead of just running my agent automatically?
A trace is not executable. It’s a record of inputs, outputs, and intermediate state-it has no run() function. You have to tell whatif how to reconstitute your agent for a single trace, with the proposed config change applied. That’s the runner contract. Three options were considered (user-supplied target, framework-specific replay, single-LLM-call replay) and the user-supplied target won on generality.
Why Pydantic for the contract?
Type clarity at the boundary, validation on construction, serialization for the JSON report, IDE autocomplete for users implementing the runner. Standard choice for modern Python public APIs.
Will you support {LangGraph, AutoGen, CrewAI, OpenAI Assistants}?
The runner contract is framework-agnostic by design-anything that’s a Python callable can be the - -target`. v0.1 ships a reference adapter for the raw Anthropic SDK; v0.1.1 adds LangChain and LangGraph stubs/docs. AutoGen, CrewAI, OpenAI Assistants are community-contributable in the same shape.
CI / Path Z¶
When does the GitHub Action ship?
v0.2 (M11). It’s a thin wrapper (~50 lines) over the v0.1 CLI-the architecture already supports it because whatif fork produces JSON + exit codes designed for CI.
Will whatif support GitLab CI / Jenkins / CircleCI?
The CLI already does-anything that runs a binary and reads exit codes works. The whatif-action is just the GitHub-specific wrapper. GitLab / Jenkins / CircleCI examples are in the GitHub Actions integration page.
How do I prevent flaky verdicts in CI?
Two things, both built in:
Stable baseline sampling via
selection.baseline.sample: random+seed: 42. Without a stable seed, baseline drift can flip a green build red between runs.Decision policy thresholds in config-
min_replay_validity,regression_threshold. Tune these to your tolerance.
Project status¶
When does v0.1 ship?
Targeted for M10 (the first month of Q4 in the author’s 12-month plan). Track progress on GitHub releases.
Is this safe to depend on?
Pre-alpha through v0.1. Public API may change between minor versions until 1.0-see the CHANGELOG for a record of all breaking changes.
How do I report a security issue?
Don’t open a public GitHub issue. Use the private vulnerability reporting flow on GitHub. See the project’s SECURITY.md for the disclosure policy.