Skip to main content
Evals are most valuable when they run on every change, not just locally before a deploy. Integrating niceeval into CI gives you an automatic regression gate: the pipeline turns red if any agent behavior regresses, the JUnit XML report surfaces exactly which eval failed in your test dashboard, and fingerprint-based caching keeps repeat runs fast by skipping evals whose fixtures and config haven’t changed.

Why evals belong in CI

Agent behavior can regress just like any other code. A prompt template change, a new tool registration, a dependency update, or a model version bump can silently break an eval that passed yesterday. Running evals in CI catches these regressions before they reach production — in the same way unit tests catch logic bugs before they ship.

Exit codes

niceeval’s exit codes map directly to CI pass/fail:
ConditionExit codeCI result
All evals passed (or passed without --strict)0✅ Green
Any eval failedNon-zero❌ Red
Any eval passed with --strictNon-zero❌ Red
The distinction between passed and failed matters: a passed outcome means all hard gate assertions passed but a soft quality threshold (e.g. an LLM-as-judge score) fell below its target. Without --strict, this is yellow — it surfaces in the report but doesn’t break the build. With --strict, it turns red.
Use --strict on nightly runs or release branches where quality regressions matter. Leave it off on PR runs so soft-assertion noise doesn’t block routine development.

GitHub Actions example

# .github/workflows/evals.yml
name: Evals

on:
  push:
    branches: [main]
  pull_request:

jobs:
  evals:
    runs-on: ubuntu-latest
    steps:
      - uses: actions/checkout@v4

      - uses: actions/setup-node@v4
        with:
          node-version: 20

      - name: Install dependencies
        run: npm ci

      - name: Restore eval cache
        uses: actions/cache@v4
        with:
          path: .niceeval
          key: niceeval-${{ hashFiles('evals/**', 'niceeval.config.ts') }}
          restore-keys: niceeval-

      - name: Run evals
        run: npx niceeval exp ci --strict --junit .niceeval/junit.xml
        env:
          ANTHROPIC_API_KEY: ${{ secrets.ANTHROPIC_API_KEY }}
          OPENAI_API_KEY: ${{ secrets.OPENAI_API_KEY }}

      - name: Publish test results
        uses: mikepenz/action-junit-report@v4
        if: always()
        with:
          report_paths: .niceeval/junit.xml

Setting up secrets

niceeval itself doesn’t manage API keys — each agent adapter reads the keys it needs from environment variables. Add your keys as repository secrets in GitHub, then pass them as env in the workflow step:
1

Add secrets to your repository

Go to Settings → Secrets and variables → Actions and add the secrets your agents need. Common values:
  • ANTHROPIC_API_KEY — for Claude Code and claude-haiku-4-5 judge
  • OPENAI_API_KEY — for Codex and GPT-based agents
2

Pass secrets as env in the workflow

Reference them under the env key of the step that runs npx niceeval:
- name: Run evals
  run: npx niceeval exp ci --strict
  env:
    ANTHROPIC_API_KEY: ${{ secrets.ANTHROPIC_API_KEY }}
    OPENAI_API_KEY: ${{ secrets.OPENAI_API_KEY }}
3

Verify secrets are available

Use --dry to confirm discovery runs cleanly without hitting the API before burning tokens on a misconfigured run.

JUnit reporter

Pass --junit <path> to write a JUnit XML report alongside the run. Most CI systems can ingest JUnit XML and display per-test results in their UI:
npx niceeval exp ci --junit .niceeval/junit.xml
You can also configure the reporter permanently in niceeval.config.ts so you never have to remember the flag:
// niceeval.config.ts
import { defineConfig } from "niceeval";
import { Console, JUnit } from "niceeval/reporters";

export default defineConfig({
  reporters: [Console(), JUnit(".niceeval/junit.xml")],
  judge: { model: "anthropic/claude-haiku-4-5" },
  maxConcurrency: 4,
});

Checking discovery without running

Use --dry to verify that your eval files are discovered correctly — it prints the full list of evals that would run, without executing any of them or making any API calls:
npx niceeval exp ci --dry
npx niceeval exp ci memory/ --dry   # check a subset
This is useful for debugging filter expressions and confirming that new evals appear in CI before you burn tokens on a full run.

Caching .niceeval/ between runs

niceeval stores fingerprinted results in .niceeval/. On the next run, any eval whose fixture and config fingerprint is unchanged is skipped and its cached result is reused. Persisting .niceeval/ in CI makes repeated runs dramatically faster — only changed evals re-run. In GitHub Actions, cache the directory using the file hash of your evals and config as the cache key:
- name: Restore eval cache
  uses: actions/cache@v4
  with:
    path: .niceeval
    key: niceeval-${{ hashFiles('evals/**', 'niceeval.config.ts') }}
    restore-keys: niceeval-
The restore-keys fallback lets a run start from the most recent partial cache even when the exact key doesn’t match. niceeval will re-run only the evals that have actually changed.

Controlling concurrency in CI

CI runners have fixed CPU and memory budgets. Use --max-concurrency to keep sandbox evals from exhausting resources:
# Conservative setting for a standard 2-CPU GitHub-hosted runner
npx niceeval exp ci --max-concurrency 2

# Higher value for large runners or cloud sandbox backends
npx niceeval exp ci --max-concurrency 8
You can also set maxConcurrency in niceeval.config.ts and override it per CI environment.

PR runs

Run evals on every pull request without --strict and with a moderate --max-concurrency. Focus on fast feedback: use --tag smoke to run only a subset of critical evals if the full suite is too slow for PRs.
run: npx niceeval exp ci --tag smoke --max-concurrency 4

Nightly full matrix

Run the full eval suite nightly with --strict and --no-early-exit to collect complete pass-rate distributions. Use experiments to compare agents and models and track stability over time.
run: npx niceeval exp compare --strict --no-early-exit

When to use --strict

ScenarioRecommendation
PR gate — block on hard failures onlyNo --strict
Release branch — block on any quality regression--strict
Nightly stability run — measure soft assertion trends--strict, with results stored for trending
Local development iterationNo --strict