Why evals belong in CI
Agent behavior can regress just like any other code. A prompt template change, a new tool registration, a dependency update, or a model version bump can silently break an eval that passed yesterday. Running evals in CI catches these regressions before they reach production — in the same way unit tests catch logic bugs before they ship.Exit codes
niceeval’s exit codes map directly to CI pass/fail:| Condition | Exit code | CI result |
|---|---|---|
All evals passed (or passed without --strict) | 0 | ✅ Green |
Any eval failed | Non-zero | ❌ Red |
Any eval passed with --strict | Non-zero | ❌ Red |
passed and failed matters: a passed outcome means all hard gate assertions passed but a soft quality threshold (e.g. an LLM-as-judge score) fell below its target. Without --strict, this is yellow — it surfaces in the report but doesn’t break the build. With --strict, it turns red.
GitHub Actions example
Setting up secrets
niceeval itself doesn’t manage API keys — each agent adapter reads the keys it needs from environment variables. Add your keys as repository secrets in GitHub, then pass them asenv in the workflow step:
Add secrets to your repository
Go to Settings → Secrets and variables → Actions and add the secrets your agents need. Common values:
ANTHROPIC_API_KEY— for Claude Code and claude-haiku-4-5 judgeOPENAI_API_KEY— for Codex and GPT-based agents
Pass secrets as env in the workflow
Reference them under the
env key of the step that runs npx niceeval:JUnit reporter
Pass--junit <path> to write a JUnit XML report alongside the run. Most CI systems can ingest JUnit XML and display per-test results in their UI:
niceeval.config.ts so you never have to remember the flag:
Checking discovery without running
Use--dry to verify that your eval files are discovered correctly — it prints the full list of evals that would run, without executing any of them or making any API calls:
Caching .niceeval/ between runs
niceeval stores fingerprinted results in .niceeval/. On the next run, any eval whose fixture and config fingerprint is unchanged is skipped and its cached result is reused. Persisting .niceeval/ in CI makes repeated runs dramatically faster — only changed evals re-run.
In GitHub Actions, cache the directory using the file hash of your evals and config as the cache key:
The
restore-keys fallback lets a run start from the most recent partial cache even when the exact key doesn’t match. niceeval will re-run only the evals that have actually changed.Controlling concurrency in CI
CI runners have fixed CPU and memory budgets. Use--max-concurrency to keep sandbox evals from exhausting resources:
maxConcurrency in niceeval.config.ts and override it per CI environment.
Recommended CI patterns
PR runs
Run evals on every pull request without
--strict and with a moderate --max-concurrency. Focus on fast feedback: use --tag smoke to run only a subset of critical evals if the full suite is too slow for PRs.Nightly full matrix
Run the full eval suite nightly with
--strict and --no-early-exit to collect complete pass-rate distributions. Use experiments to compare agents and models and track stability over time.When to use --strict
| Scenario | Recommendation |
|---|---|
| PR gate — block on hard failures only | No --strict |
| Release branch — block on any quality regression | --strict |
| Nightly stability run — measure soft assertion trends | --strict, with results stored for trending |
| Local development iteration | No --strict |