Context-regression CI

Verified context is only trustworthy while it is re-measured. This guide wires the keys connect eval quality verdict into CI so a pull request that would degrade your graph's quality fails (or warns) before merge, with the regression table posted as a single, continuously updated PR comment. It also covers the scheduled cross-model efficacy run that keeps the published verification claims continuously true as models change.

One comment, ever. The CI action finds its previous PR comment by an invisible marker and edits it in place. Pushing ten times to a PR produces one up-to-date regression table, not ten comments.

The contract: one command, four exit codes

Everything in this guide is a wrapper around one headless command. It evaluates a graph's quality report against the published G2 bar (≥ 90% supported, ≤ 2% unsupported) and, when given a baseline, diffs against the last accepted state:

keys connect eval \
  --baseline ci/connect-eval-baseline.json \
  --tolerance 1 \
  --output markdown
Exit codeMeaningCI behavior
0Quality bar met, no regression beyond toleranceCheck passes
1Graph misses the absolute G2 barCheck fails (warn mode: reported, exit 0)
2Config / usage error — the gate could not evaluateCheck fails — never downgraded by warn mode
3Quality regressed vs the committed baselineCheck fails (warn mode: reported, exit 0)

Two failure codes are deliberate: 1 means "this graph is below the published bar, full stop"; 3 means "still above the bar, but worse than the state your team last accepted". A baseline regression on a passing graph is the early warning you want from CI.

The composite action

packages/connect-eval-github-action wraps the command for GitHub Actions and Forgejo. Remote mode evaluates the latest completed ingest run of a workspace through the gateway-key-authed Connect v1 API:

permissions:
  contents: read
  pull-requests: write   # sticky comment

- uses: ./packages/connect-eval-github-action
  with:
    gateway_key: ${{ secrets.RESTORMEL_GATEWAY_KEY }}   # secret — never logged
    workspace: ${{ vars.RESTORMEL_WORKSPACE_ID }}
    project: ${{ vars.RESTORMEL_PROJECT_ID }}            # optional
    baseline_path: ci/connect-eval-baseline.json
    tolerance: '1'
    github_token: ${{ secrets.GITHUB_TOKEN }}

Local counts mode (counts_path) evaluates a counts or quality-report JSON produced by any pipeline — no network, no key — which is how this repository dogfoods the gate on every PR to the Connect quality pipeline. The gateway key travels to the CLI via environment only, never argv, and is never printed.

The action exposes four outputs for downstream steps:

  • verdictpass · quality_fail · regression · config_error · error
  • exit_code — the raw CLI code (0/1/2/3) before any warn-mode downgrade
  • regression"true" when the baseline diff flagged a regression
  • commented"true" when the sticky comment was created or updated

Baseline lifecycle

The baseline is a committed JSON artifact written by keys connect eval --save-baseline <file>. It records the accepted verdict and the source-set fingerprint of the corpus it was measured on. Three rules keep it honest:

  • Fingerprint supersession, not false alarms. When the corpus changes, the fingerprint changes and the diff reports baseline superseded — regression checks are skipped, never reported as failures. Re-save the baseline from the new corpus to re-arm the gate.
  • Tolerance absorbs rounding. G2 percentages are integer-rounded, so the default 1-point tolerance absorbs jitter. Raise it deliberately rather than deleting the baseline.
  • Re-saving is a review event. The baseline lives in git; accepting a lower bar is a visible diff in the PR, not a silent state change.

Warn mode vs blocking

Start non-blocking: warn_only: 'true' reports quality failures and regressions in the summary and the sticky comment but exits 0, so teams can tune tolerance and baselines without red checks. Config errors (exit 2) still fail even in warn mode — a gate that cannot evaluate must be loud, or it rots silently. Flip to blocking once the gate evaluates a live ingest (remote mode) and the baseline has been re-saved from that run.

Forgejo mirrors

The sticky comment uses the GitHub-compatible issues API at GITHUB_API_URL, which Forgejo also serves, so the same action runs unchanged in .forgejo/workflows. One sharp edge: when a repository has a .forgejo/workflows directory, it overrides .github/workflows on the Forgejo side — a gate added only under .github never runs on the mirror. Ship both variants in the same PR and keep them in sync.

Scheduled claims-integrity run

Quality bars about verification itself ("the validator catches fabricated claims") cannot be proven once and assumed forever — model and routing changes can silently invalidate them. A weekly scheduled workflow re-runs the verifier-efficacy benchmark under cross-model routing (extraction and validation on different model families, keyed by the OPENAI_API_KEY and TOGETHER_API_KEY CI secrets) and fails if any signed-off bar regresses:

  • fabricated-claim recall ≥ 95%
  • cross-model misattribution recall ≥ 90%
  • supported false-flag rate ≤ 15%
  • affirm-unseen 0% under cross-model routing (the fail-open probe)

Each run uploads a dated results snapshot and prints the bar table in the run summary; the claims ledger points at this workflow as the continuous evidence for its measured rows. A red run means the affected ledger rows — and any marketing copy citing them — are treated as broken until the bar recovers.

Reproduce locally

# evaluate the latest ingest run of your workspace
RESTORMEL_GATEWAY_KEY=… keys connect eval --workspace ws_… --output pretty

# save the accepted state as the CI baseline
keys connect eval --workspace ws_… --save-baseline ci/connect-eval-baseline.json

# what CI runs on every PR
keys connect eval --workspace ws_… \
  --baseline ci/connect-eval-baseline.json --tolerance 1 --output markdown
echo "exit: $?"   # 0 pass · 1 bar miss · 2 config · 3 regression

Related reading: Verified context (what the verdict measures and how to audit it) and Connect first graph onboarding (producing the ingest runs the gate evaluates).