Context-regression CI
Verified context is only trustworthy while it is re-measured. This guide wires the keys connect eval quality verdict into CI so a pull request that would
degrade your graph's quality fails (or warns) before merge, with the regression table posted as a
single, continuously updated PR comment. It also covers the scheduled cross-model efficacy run that keeps
the published verification claims continuously true as models change.
The contract: one command, four exit codes
Everything in this guide is a wrapper around one headless command. It evaluates a graph's quality report against the published G2 bar (≥ 90% supported, ≤ 2% unsupported) and, when given a baseline, diffs against the last accepted state:
keys connect eval \
--baseline ci/connect-eval-baseline.json \
--tolerance 1 \
--output markdown | Exit code | Meaning | CI behavior |
|---|---|---|
0 | Quality bar met, no regression beyond tolerance | Check passes |
1 | Graph misses the absolute G2 bar | Check fails (warn mode: reported, exit 0) |
2 | Config / usage error — the gate could not evaluate | Check fails — never downgraded by warn mode |
3 | Quality regressed vs the committed baseline | Check fails (warn mode: reported, exit 0) |
Two failure codes are deliberate: 1 means "this graph is below the
published bar, full stop"; 3 means "still above the bar, but worse than
the state your team last accepted". A baseline regression on a passing graph is the early warning you want
from CI.
The composite action
packages/connect-eval-github-action wraps the command for GitHub Actions
and Forgejo. Remote mode evaluates the latest completed ingest run of a workspace through the
gateway-key-authed Connect v1 API:
permissions:
contents: read
pull-requests: write # sticky comment
- uses: ./packages/connect-eval-github-action
with:
gateway_key: ${{ secrets.RESTORMEL_GATEWAY_KEY }} # secret — never logged
workspace: ${{ vars.RESTORMEL_WORKSPACE_ID }}
project: ${{ vars.RESTORMEL_PROJECT_ID }} # optional
baseline_path: ci/connect-eval-baseline.json
tolerance: '1'
github_token: ${{ secrets.GITHUB_TOKEN }} Local counts mode (counts_path) evaluates a counts or quality-report JSON
produced by any pipeline — no network, no key — which is how this repository dogfoods the gate on every PR
to the Connect quality pipeline. The gateway key travels to the CLI via environment only, never argv, and
is never printed.
The action exposes four outputs for downstream steps:
verdict—pass·quality_fail·regression·config_error·errorexit_code— the raw CLI code (0/1/2/3) before any warn-mode downgraderegression—"true"when the baseline diff flagged a regressioncommented—"true"when the sticky comment was created or updated
Baseline lifecycle
The baseline is a committed JSON artifact written by keys connect eval --save-baseline <file>. It records the accepted
verdict and the source-set fingerprint of the corpus it was measured on. Three rules keep
it honest:
- Fingerprint supersession, not false alarms. When the corpus changes, the fingerprint changes and the diff reports baseline superseded — regression checks are skipped, never reported as failures. Re-save the baseline from the new corpus to re-arm the gate.
- Tolerance absorbs rounding. G2 percentages are integer-rounded, so the default 1-point tolerance absorbs jitter. Raise it deliberately rather than deleting the baseline.
- Re-saving is a review event. The baseline lives in git; accepting a lower bar is a visible diff in the PR, not a silent state change.
Warn mode vs blocking
Start non-blocking: warn_only: 'true' reports quality failures and
regressions in the summary and the sticky comment but exits 0, so teams can tune tolerance and baselines
without red checks. Config errors (exit 2) still fail even in warn mode — a gate that cannot evaluate must
be loud, or it rots silently. Flip to blocking once the gate evaluates a live ingest (remote mode) and the
baseline has been re-saved from that run.
Forgejo mirrors
The sticky comment uses the GitHub-compatible issues API at GITHUB_API_URL,
which Forgejo also serves, so the same action runs unchanged in .forgejo/workflows. One sharp edge: when a repository has a .forgejo/workflows directory, it overrides .github/workflows on the Forgejo side — a gate added only under .github never runs on the mirror. Ship both variants in the same PR and
keep them in sync.
Scheduled claims-integrity run
Quality bars about verification itself ("the validator catches fabricated claims") cannot be
proven once and assumed forever — model and routing changes can silently invalidate them. A weekly
scheduled workflow re-runs the verifier-efficacy benchmark under cross-model routing (extraction and validation on different model families, keyed by the OPENAI_API_KEY and TOGETHER_API_KEY CI
secrets) and fails if any signed-off bar regresses:
- fabricated-claim recall ≥ 95%
- cross-model misattribution recall ≥ 90%
- supported false-flag rate ≤ 15%
- affirm-unseen 0% under cross-model routing (the fail-open probe)
Each run uploads a dated results snapshot and prints the bar table in the run summary; the claims ledger points at this workflow as the continuous evidence for its measured rows. A red run means the affected ledger rows — and any marketing copy citing them — are treated as broken until the bar recovers.
Reproduce locally
# evaluate the latest ingest run of your workspace
RESTORMEL_GATEWAY_KEY=… keys connect eval --workspace ws_… --output pretty
# save the accepted state as the CI baseline
keys connect eval --workspace ws_… --save-baseline ci/connect-eval-baseline.json
# what CI runs on every PR
keys connect eval --workspace ws_… \
--baseline ci/connect-eval-baseline.json --tolerance 1 --output markdown
echo "exit: $?" # 0 pass · 1 bar miss · 2 config · 3 regression Related reading: Verified context (what the verdict measures and how to audit it) and Connect first graph onboarding (producing the ingest runs the gate evaluates).