Run niceeval evals in GitHub Actions and CI pipelines

Evals are most valuable when they run on every change, not just locally before a deploy. Integrating niceeval into CI gives you an automatic regression gate: the pipeline turns red if any agent behavior regresses, the JUnit XML report surfaces exactly which eval failed in your test dashboard, and fingerprint-based caching keeps repeat runs fast by skipping evals whose fixtures and config haven’t changed.

Why evals belong in CI

Agent behavior can regress just like any other code. A prompt template change, a new tool registration, a dependency update, or a model version bump can silently break an eval that passed yesterday. Running evals in CI catches these regressions before they reach production — in the same way unit tests catch logic bugs before they ship.

Exit codes

niceeval’s exit codes map directly to CI pass/fail:

Condition	Exit code	CI result
All evals `passed` (or `passed` without `--strict`)	`0`	✅ Green
Any eval `failed`	Non-zero	❌ Red
Any eval `passed` with `--strict`	Non-zero	❌ Red

The distinction between passed and failed matters: a passed outcome means all hard gate assertions passed but a soft quality threshold (e.g. an LLM-as-judge score) fell below its target. Without --strict, this is yellow — it surfaces in the report but doesn’t break the build. With --strict, it turns red.

Use --strict on nightly runs or release branches where quality regressions matter. Leave it off on PR runs so soft-assertion noise doesn’t block routine development.

GitHub Actions example

# .github/workflows/evals.yml
name: Evals

on:
  push:
    branches: [main]
  pull_request:

jobs:
  evals:
    runs-on: ubuntu-latest
    steps:
      - uses: actions/checkout@v4

      - uses: actions/setup-node@v4
        with:
          node-version: 20

      - name: Install dependencies
        run: npm ci

      - name: Restore eval cache
        uses: actions/cache@v4
        with:
          path: .niceeval
          key: niceeval-${{ hashFiles('evals/**', 'niceeval.config.ts') }}
          restore-keys: niceeval-

      - name: Run evals
        run: npx niceeval exp ci --strict --junit .niceeval/junit.xml
        env:
          ANTHROPIC_API_KEY: ${{ secrets.ANTHROPIC_API_KEY }}
          OPENAI_API_KEY: ${{ secrets.OPENAI_API_KEY }}

      - name: Publish test results
        uses: mikepenz/action-junit-report@v4
        if: always()
        with:
          report_paths: .niceeval/junit.xml

Setting up secrets

niceeval itself doesn’t manage API keys — each agent adapter reads the keys it needs from environment variables. Add your keys as repository secrets in GitHub, then pass them as env in the workflow step:

Add secrets to your repository

Go to Settings → Secrets and variables → Actions and add the secrets your agents need. Common values:

ANTHROPIC_API_KEY — for Claude Code and claude-haiku-4-5 judge
OPENAI_API_KEY — for Codex and GPT-based agents

Pass secrets as env in the workflow

Reference them under the env key of the step that runs npx niceeval:

- name: Run evals
  run: npx niceeval exp ci --strict
  env:
    ANTHROPIC_API_KEY: ${{ secrets.ANTHROPIC_API_KEY }}
    OPENAI_API_KEY: ${{ secrets.OPENAI_API_KEY }}

Verify secrets are available

Use --dry to confirm discovery runs cleanly without hitting the API before burning tokens on a misconfigured run.

JUnit reporter

Pass --junit <path> to write a JUnit XML report alongside the run. Most CI systems can ingest JUnit XML and display per-test results in their UI:

npx niceeval exp ci --junit .niceeval/junit.xml

You can also configure the reporter permanently in niceeval.config.ts so you never have to remember the flag:

// niceeval.config.ts
import { defineConfig } from "niceeval";
import { Console, JUnit } from "niceeval/reporters";

export default defineConfig({
  reporters: [Console(), JUnit(".niceeval/junit.xml")],
  judge: { model: "anthropic/claude-haiku-4-5" },
  maxConcurrency: 4,
});

Checking discovery without running

Use --dry to verify that your eval files are discovered correctly — it prints the full list of evals that would run, without executing any of them or making any API calls:

npx niceeval exp ci --dry
npx niceeval exp ci memory/ --dry   # check a subset

This is useful for debugging filter expressions and confirming that new evals appear in CI before you burn tokens on a full run.

Caching `.niceeval/` between runs

niceeval stores fingerprinted results in .niceeval/. On the next run, any eval whose fixture and config fingerprint is unchanged is skipped and its cached result is reused. Persisting .niceeval/ in CI makes repeated runs dramatically faster — only changed evals re-run. In GitHub Actions, cache the directory using the file hash of your evals and config as the cache key:

- name: Restore eval cache
  uses: actions/cache@v4
  with:
    path: .niceeval
    key: niceeval-${{ hashFiles('evals/**', 'niceeval.config.ts') }}
    restore-keys: niceeval-

The restore-keys fallback lets a run start from the most recent partial cache even when the exact key doesn’t match. niceeval will re-run only the evals that have actually changed.

Controlling concurrency in CI

CI runners have fixed CPU and memory budgets. Use --max-concurrency to keep sandbox evals from exhausting resources:

# Conservative setting for a standard 2-CPU GitHub-hosted runner
npx niceeval exp ci --max-concurrency 2

# Higher value for large runners or cloud sandbox backends
npx niceeval exp ci --max-concurrency 8

You can also set maxConcurrency in niceeval.config.ts and override it per CI environment.

Recommended CI patterns

PR runs

Run evals on every pull request without --strict and with a moderate --max-concurrency. Focus on fast feedback: use --tag smoke to run only a subset of critical evals if the full suite is too slow for PRs.

run: npx niceeval exp ci --tag smoke --max-concurrency 4

Nightly full matrix

Run the full eval suite nightly with --strict and --no-early-exit to collect complete pass-rate distributions. Use experiments to compare agents and models and track stability over time.

run: npx niceeval exp compare --strict --no-early-exit

When to use `--strict`

Scenario	Recommendation
PR gate — block on hard failures only	No `--strict`
Release branch — block on any quality regression	`--strict`
Nightly stability run — measure soft assertion trends	`--strict`, with results stored for trending
Local development iteration	No `--strict`

​Why evals belong in CI

​Exit codes

​GitHub Actions example

​Setting up secrets

​JUnit reporter

​Checking discovery without running

​Caching .niceeval/ between runs

​Controlling concurrency in CI

​Recommended CI patterns

PR runs

Nightly full matrix

​When to use --strict

Why evals belong in CI

Exit codes

GitHub Actions example

Setting up secrets

JUnit reporter

Checking discovery without running

Caching `.niceeval/` between runs

Controlling concurrency in CI

Recommended CI patterns

When to use `--strict`