GitHub Actions¶
The CLI is the wedge; CI integration is the moat. This page covers the v0.2 GitHub Action wrapper and the underlying patterns that work in any CI system.
The wrapper Action (v0.2)¶
# .github/workflows/llm-regression.yml
name: llm-regression
on:
pull_request:
paths:
- "prompts/**"
- "src/agent/**"
permissions:
contents: read
pull-requests: write # for posting the report as a PR comment
jobs:
whatif:
runs-on: ubuntu-latest
steps:
- uses: actions/checkout@v4
- uses: victoralfred/whatif-action@v0
with:
config: whatif.config.yaml
change: system_prompt=prompts/v3.txt
comment-on-pr: true
env:
LANGFUSE_HOST: ${{ secrets.LANGFUSE_HOST }}
LANGFUSE_PUBLIC_KEY: ${{ secrets.LANGFUSE_PUBLIC_KEY }}
LANGFUSE_SECRET_KEY: ${{ secrets.LANGFUSE_SECRET_KEY }}
ANTHROPIC_API_KEY: ${{ secrets.ANTHROPIC_API_KEY }}
The Action is intentionally thin (~50 lines under the hood). It just invokes whatif fork and surfaces the verdict. Anything whatif fork does, the Action can do.
Generic CI pattern (v0.1)¶
If you’re not on GitHub or want full control, the same pattern works in any CI:
# .gitlab-ci.yml example
whatif:
image: python:3.12-slim
before_script:
- pip install whatif[langfuse,inspect,anthropic]
script:
- whatif fork
--source langfuse
--target "python:agent.replay:run"
--failures "score-below:0.6,since:24h,limit:20"
--baseline "score-above:0.8,since:24h,limit:20"
--change "system_prompt=$PROMPT_FILE"
--score "inspect_ai:faithfulness"
--report whatif-report.md
--json whatif-report.json
--fail-on-regression
artifacts:
when: always
paths:
- whatif-report.md
- whatif-report.json
Exit-code behavior¶
The CI integration relies entirely on exit codes:
Exit code |
CI behavior |
Meaning |
|---|---|---|
|
Job succeeds. PR check is green. |
Verdict: Ship. |
|
Job fails. PR check is red. |
Verdict: Don’t Ship. |
|
Job fails. PR check is red, but with an “Inconclusive” annotation. |
Setup / replay / scoring failure. |
In Path Z, exit code 1 is what blocks the merge.
Posting the report as a PR comment¶
The Action’s comment-on-pr: true does this automatically. For generic CI, post the Markdown report yourself:
gh pr comment $PR_NUMBER --body-file whatif-report.md
Or via the GitHub API directly:
curl -X POST \
-H "Authorization: Bearer $GH_TOKEN" \
-H "Accept: application/vnd.github+json" \
https://api.github.com/repos/$REPO/issues/$PR_NUMBER/comments \
-d "$(jq -Rs '{body: .}' < whatif-report.md)"
Required permissions¶
Permission |
Why |
|---|---|
|
To check out the PR. |
|
To post the verdict report as a PR comment. |
|
Only if your scorer or runner uses OIDC (e.g., for cloud LLM auth). |
Secrets¶
Secret |
When needed |
|---|---|
|
For - -source langfuse`. |
|
For the runner’s LLM calls AND the Inspect AI judge. |
Custom tracer creds |
When using - -source` other than Langfuse (v0.2+). |
Use GitHub Environments for production credentials and require a manual approval before the job runs against prod data:
jobs:
whatif:
environment: production-traces
# the job won't run until a reviewer approves access to that environment
What runs when (recommendations)¶
Trigger |
Cohort size |
Cost |
Use case |
|---|---|---|---|
|
20 + 20 |
~$0.20 |
Per-PR check on prompt / model changes. |
|
50 + 50 |
~$1 |
Continuous regression catch on |
|
100 + 100 |
~$3 |
Release-gate: full sample before tagging. |
Cost estimates assume Claude Haiku 4.5 as the Inspect judge.
What this enables (Path Z, made concrete)¶
✗ whatif / experiment-runner-Verdict: Don't Ship
Failures: 14/20 improved, but baseline regressed 6/20.
Median baseline Δ: -0.18 (CI: [-0.24, -0.12])
Top regression: trace t_492af ("agent now refuses requests it
previously handled correctly")
Full report → /artifacts/whatif-report.md
That output, on a PR check, is the moment whatif becomes infrastructure. See Path Z for the trajectory.