GitHub Actions

The CLI is the wedge; CI integration is the moat. This page covers the v0.2 GitHub Action wrapper and the underlying patterns that work in any CI system.

The wrapper Action (v0.2)

# .github/workflows/llm-regression.yml
name: llm-regression
on:
  pull_request:
    paths:
      - "prompts/**"
      - "src/agent/**"

permissions:
  contents: read
  pull-requests: write   # for posting the report as a PR comment

jobs:
  whatif:
    runs-on: ubuntu-latest
    steps:
      - uses: actions/checkout@v4

      - uses: victoralfred/whatif-action@v0
        with:
          config: whatif.config.yaml
          change: system_prompt=prompts/v3.txt
          comment-on-pr: true
        env:
          LANGFUSE_HOST: ${{ secrets.LANGFUSE_HOST }}
          LANGFUSE_PUBLIC_KEY: ${{ secrets.LANGFUSE_PUBLIC_KEY }}
          LANGFUSE_SECRET_KEY: ${{ secrets.LANGFUSE_SECRET_KEY }}
          ANTHROPIC_API_KEY: ${{ secrets.ANTHROPIC_API_KEY }}

The Action is intentionally thin (~50 lines under the hood). It just invokes whatif fork and surfaces the verdict. Anything whatif fork does, the Action can do.

Generic CI pattern (v0.1)

If you’re not on GitHub or want full control, the same pattern works in any CI:

# .gitlab-ci.yml example
whatif:
  image: python:3.12-slim
  before_script:
    - pip install whatif[langfuse,inspect,anthropic]
  script:
    - whatif fork
        --source langfuse
        --target "python:agent.replay:run"
        --failures "score-below:0.6,since:24h,limit:20"
        --baseline "score-above:0.8,since:24h,limit:20"
        --change "system_prompt=$PROMPT_FILE"
        --score "inspect_ai:faithfulness"
        --report whatif-report.md
        --json whatif-report.json
        --fail-on-regression
  artifacts:
    when: always
    paths:
      - whatif-report.md
      - whatif-report.json

Exit-code behavior

The CI integration relies entirely on exit codes:

Exit code

CI behavior

Meaning

0

Job succeeds. PR check is green.

Verdict: Ship.

1

Job fails. PR check is red.

Verdict: Don’t Ship.

2

Job fails. PR check is red, but with an “Inconclusive” annotation.

Setup / replay / scoring failure.

In Path Z, exit code 1 is what blocks the merge.

Posting the report as a PR comment

The Action’s comment-on-pr: true does this automatically. For generic CI, post the Markdown report yourself:

gh pr comment $PR_NUMBER --body-file whatif-report.md

Or via the GitHub API directly:

curl -X POST \
  -H "Authorization: Bearer $GH_TOKEN" \
  -H "Accept: application/vnd.github+json" \
  https://api.github.com/repos/$REPO/issues/$PR_NUMBER/comments \
  -d "$(jq -Rs '{body: .}' < whatif-report.md)"

Required permissions

Permission

Why

contents: read

To check out the PR.

pull-requests: write

To post the verdict report as a PR comment.

id-token: write

Only if your scorer or runner uses OIDC (e.g., for cloud LLM auth).

Secrets

Secret

When needed

LANGFUSE_PUBLIC_KEY + LANGFUSE_SECRET_KEY

For - -source langfuse`.

ANTHROPIC_API_KEY (or other LLM API key)

For the runner’s LLM calls AND the Inspect AI judge.

Custom tracer creds

When using - -source` other than Langfuse (v0.2+).

Use GitHub Environments for production credentials and require a manual approval before the job runs against prod data:

jobs:
  whatif:
    environment: production-traces
    # the job won't run until a reviewer approves access to that environment

What runs when (recommendations)

Trigger

Cohort size

Cost

Use case

pull_request (paths-filtered)

20 + 20

~$0.20

Per-PR check on prompt / model changes.

schedule (nightly)

50 + 50

~$1

Continuous regression catch on main.

release (manual / pre-tag)

100 + 100

~$3

Release-gate: full sample before tagging.

Cost estimates assume Claude Haiku 4.5 as the Inspect judge.

What this enables (Path Z, made concrete)

✗ whatif / experiment-runner-Verdict: Don't Ship

  Failures: 14/20 improved, but baseline regressed 6/20.
  Median baseline Δ: -0.18  (CI: [-0.24, -0.12])

  Top regression: trace t_492af  ("agent now refuses requests it
                                   previously handled correctly")

  Full report → /artifacts/whatif-report.md

That output, on a PR check, is the moment whatif becomes infrastructure. See Path Z for the trajectory.