GitHub Actions¶

The CLI is the wedge; CI integration is the moat. This page covers the v0.2 GitHub Action wrapper and the underlying patterns that work in any CI system.

The wrapper Action (v0.2)¶

# .github/workflows/llm-regression.yml
name: llm-regression
on:
  pull_request:
    paths:
      - "prompts/**"
      - "src/agent/**"

permissions:
  contents: read
  pull-requests: write   # for posting the report as a PR comment

jobs:
  whatif:
    runs-on: ubuntu-latest
    steps:
      - uses: actions/checkout@v4

      - uses: victoralfred/whatif-action@v0
        with:
          config: whatif.config.yaml
          change: system_prompt=prompts/v3.txt
          comment-on-pr: true
        env:
          LANGFUSE_HOST: ${{ secrets.LANGFUSE_HOST }}
          LANGFUSE_PUBLIC_KEY: ${{ secrets.LANGFUSE_PUBLIC_KEY }}
          LANGFUSE_SECRET_KEY: ${{ secrets.LANGFUSE_SECRET_KEY }}
          ANTHROPIC_API_KEY: ${{ secrets.ANTHROPIC_API_KEY }}

The Action is intentionally thin (~50 lines under the hood). It just invokes whatif fork and surfaces the verdict. Anything whatif fork does, the Action can do.

Generic CI pattern (v0.1)¶

If you’re not on GitHub or want full control, the same pattern works in any CI:

# .gitlab-ci.yml example
whatif:
  image: python:3.12-slim
  before_script:
    - pip install whatif[langfuse,inspect,anthropic]
  script:
    - whatif fork
        --source langfuse
        --target "python:agent.replay:run"
        --failures "score-below:0.6,since:24h,limit:20"
        --baseline "score-above:0.8,since:24h,limit:20"
        --change "system_prompt=$PROMPT_FILE"
        --score "inspect_ai:faithfulness"
        --report whatif-report.md
        --json whatif-report.json
        --fail-on-regression
  artifacts:
    when: always
    paths:
      - whatif-report.md
      - whatif-report.json

Exit-code behavior¶

The CI integration relies entirely on exit codes:

Exit code	CI behavior	Meaning
`0`	Job succeeds. PR check is green.	Verdict: Ship.
`1`	Job fails. PR check is red.	Verdict: Don’t Ship.
`2`	Job fails. PR check is red, but with an “Inconclusive” annotation.	Setup / replay / scoring failure.

In Path Z, exit code 1 is what blocks the merge.

Posting the report as a PR comment¶

The Action’s comment-on-pr: true does this automatically. For generic CI, post the Markdown report yourself:

gh pr comment $PR_NUMBER --body-file whatif-report.md

Or via the GitHub API directly:

curl -X POST \
  -H "Authorization: Bearer $GH_TOKEN" \
  -H "Accept: application/vnd.github+json" \
  https://api.github.com/repos/$REPO/issues/$PR_NUMBER/comments \
  -d "$(jq -Rs '{body: .}' < whatif-report.md)"

Required permissions¶

Permission	Why
`contents: read`	To check out the PR.
`pull-requests: write`	To post the verdict report as a PR comment.
`id-token: write`	Only if your scorer or runner uses OIDC (e.g., for cloud LLM auth).

Secrets¶

Secret	When needed
`LANGFUSE_PUBLIC_KEY` + `LANGFUSE_SECRET_KEY`	For - -source langfuse`.
`ANTHROPIC_API_KEY` (or other LLM API key)	For the runner’s LLM calls AND the Inspect AI judge.
Custom tracer creds	When using - -source` other than Langfuse (v0.2+).

Use GitHub Environments for production credentials and require a manual approval before the job runs against prod data:

jobs:
  whatif:
    environment: production-traces
    # the job won't run until a reviewer approves access to that environment

What runs when (recommendations)¶

Trigger	Cohort size	Cost	Use case
`pull_request` (paths-filtered)	20 + 20	~$0.20	Per-PR check on prompt / model changes.
`schedule` (nightly)	50 + 50	~$1	Continuous regression catch on `main`.
`release` (manual / pre-tag)	100 + 100	~$3	Release-gate: full sample before tagging.

Cost estimates assume Claude Haiku 4.5 as the Inspect judge.

What this enables (Path Z, made concrete)¶

✗ whatif / experiment-runner-Verdict: Don't Ship

  Failures: 14/20 improved, but baseline regressed 6/20.
  Median baseline Δ: -0.18  (CI: [-0.24, -0.12])

  Top regression: trace t_492af  ("agent now refuses requests it
                                   previously handled correctly")

  Full report → /artifacts/whatif-report.md

That output, on a PR check, is the moment whatif becomes infrastructure. See Path Z for the trajectory.