niceeval CLI: commands, flags, and exit codes reference

The niceeval CLI is your single entry point for discovering, running, and inspecting evals. Actual eval execution is experiment-first: exp selects the signed-in run configuration, and any positional arguments after the experiment select evals by ID prefix. Agent, model, and feature flags live in experiments/, not in ad-hoc CLI arguments.

Commands

npx niceeval exp <group> [id-prefix]

Run a named experiment group or config. Optional trailing positionals filter eval IDs.

npx niceeval init

Scaffold evals/, niceeval.config.ts, and example eval files.

npx niceeval list

Discover and print all evals without running them.

npx niceeval clean

Delete .niceeval/ historical run artifacts.

npx niceeval view

Open the result viewer to inspect artifacts from the last run.

Output language

niceeval localizes CLI and runtime messages without adding a config key. Set an environment variable when you want deterministic output:

NICEEVAL_LANG=en npx niceeval list
NICEEVAL_LANG=zh-CN npx niceeval list

Locale detection order is NICEEVAL_LANG, NICEEVAL_LOCALE, LC_ALL, LC_MESSAGES, then LANG. Values starting with zh use zh-CN; other languages use en. If none are set, niceeval defaults to zh-CN. This affects terminal/runtime text only; machine result fields and LLM judge prompts are not translated.

`npx niceeval exp [group|config] [id-prefix...]`

The exp command runs configs from experiments/. The first positional argument selects an experiment group or a single config; any remaining positionals narrow the run to evals whose IDs match those prefixes.

# Run every experiment under experiments/
npx niceeval exp

# Run one experiment group
npx niceeval exp compare-models

# Run only evals whose ID starts with "weather" inside that group
npx niceeval exp compare-models weather

Eval filters only appear after the experiment selector. Bare npx niceeval weather does not run; use npx niceeval exp local weather or npx niceeval exp compare weather.

Flags

Experiment & sandbox

--agent

string

Not supported for experiment runs. Add or copy an experiment file and set the agent there so the run configuration is reviewable and reproducible.

--sandbox

string

Temporarily choose the sandbox backend for the selected experiment. Accepted values: docker, vercel, auto.

docker — spin up a local Docker container (no cloud credentials needed).
vercel — use Vercel Sandbox (requires VERCEL_TOKEN or VERCEL_OIDC_TOKEN).
auto — detect automatically: use vercel when a Vercel token is present, otherwise fall back to docker.

npx niceeval exp local --sandbox docker
npx niceeval exp local --sandbox auto

Run control

--watch

boolean

Re-run matching evals whenever a source file changes. Useful during authoring to get instant feedback as you edit.

npx niceeval exp local weather --watch

--dry

boolean

Discover and print the evals that would be run, but do not execute them. Good for verifying your ID prefix and tag filters before a long run.

npx niceeval exp compare --dry
npx niceeval exp compare fixtures --dry

--force

boolean

Bypass the result cache. Normally, evals whose fixture content and config have not changed since their last passing run are skipped automatically. Use --force to always re-run everything regardless of cached results.

npx niceeval exp compare --force

--runs

number

Number of times to run each matched eval. Use this when you care about pass rate rather than a single pass/fail outcome. Results are grouped and reported as pass@N.

# Run each eval 5 times and report pass@5
npx niceeval exp compare fixtures/button --runs 5

--no-early-exit

boolean

Disable early-exit. By default, once one attempt for a given eval passes, niceeval cancels the remaining attempts for that eval. Pass this flag to run all --runs attempts regardless, which gives you the full pass rate distribution.

npx niceeval exp compare fixtures/button --runs 10 --no-early-exit

--tag

string

Filter evals by tag. Only evals that declare the given tag in their definition will be included in the run.

npx niceeval exp compare --tag smoke

Performance & limits

--max-concurrency

number

Override the maxConcurrency value set in niceeval.config.ts. Controls how many eval attempts run in parallel.

npx niceeval exp compare --max-concurrency 4

--timeout

number

Per-eval timeout in milliseconds. If an eval takes longer than this value, it is force-cancelled and marked as failed with error: timeout. Overrides timeoutMs from config.

# 10-minute timeout per eval
npx niceeval exp compare --timeout 600000

--budget

number

Cost limit in USD for the entire run. niceeval tracks cumulative estimated spend across all attempts. Once the budget is exceeded, no new attempts are dispatched; in-flight attempts complete normally. Overrides budget from config.

npx niceeval exp compare --budget 2.50

CI & reporting

--strict

boolean

Treat passed outcomes (where the LLM-as-judge score falls below the configured threshold) as failures. Without --strict, passed-but-below- threshold evals produce a non-zero exit code only when their outcome is explicitly failed. With --strict, they also cause a non-zero exit. Use this flag in CI to enforce quality thresholds.

npx niceeval exp ci --strict --junit .niceeval/junit.xml

--junit

string

Write a JUnit XML report to the specified path. Compatible with GitHub Actions, CircleCI, and any CI system that ingests JUnit reports.

npx niceeval exp ci --junit .niceeval/junit.xml

`npx niceeval init`

Scaffolds the minimum files needed to start writing evals in the current repository. Running init creates:

your-project/
├─ niceeval.config.ts
└─ evals/
   ├─ hello.eval.ts            # example: conversational eval
   └─ fixtures/
      └─ button/               # example: sandbox coding-agent eval
         ├─ PROMPT.md
         ├─ EVAL.ts
         └─ package.json

npm install -D niceeval
npx niceeval init

After running init, add an experiment under experiments/ that selects the agent, model, runs, and sandbox for your first run.

`npx niceeval list`

Discovers all evals (and, if an experiments/ directory exists, all experiment files) and prints their IDs. Does not run anything. Use this to verify discovery before a run.

npx niceeval list

# Output:
# classify
# weather/brooklyn
# weather/manhattan
# fixtures/button

Use npx niceeval exp <group> --dry to preview how a specific experiment will filter and expand evals.

`npx niceeval exp <group>`

Runs an experiment group. niceeval scans the experiments/ directory for TypeScript files whose default export is a defineExperiment call. The <group> argument is the folder name — all experiment files inside that folder are treated as one comparable group.

# Run all experiments in experiments/compare/
npx niceeval exp compare

Experiments expand into a matrix of agent × model × eval × runs attempts. Results are reported as pass rates grouped by (agent, model, eval), rather than single pass/fail outcomes.

`npx niceeval view`

Opens the result viewer in your browser. The viewer reads artifacts from the most recent .niceeval/<timestamp>/ output directory and lets you inspect per-eval results, transcripts, file diffs, and event streams.

npx niceeval view

Exit codes

niceeval’s exit code lets CI systems gate on eval results without parsing output.

Condition	Exit code
All evals passed or passed (without `--strict`)	`0`
Any eval failed	non-zero
Any eval passed below threshold with `--strict`	non-zero

Common patterns

Local development
CI pipeline
Pass rate measurement
Cost-guarded experiment

# Watch a specific eval while you develop
npx niceeval exp local fixtures/button --sandbox docker --watch

# Dry-run to check which evals will run
npx niceeval exp local --dry

# Force a re-run ignoring the cache
npx niceeval exp local fixtures/button --force

# .github/workflows/evals.yml
- name: Run evals
  run: npx niceeval exp ci --strict --junit .niceeval/junit.xml
  env:
    ANTHROPIC_API_KEY: ${{ secrets.ANTHROPIC_API_KEY }}

# Run each eval 10 times; stop early once one attempt passes
npx niceeval exp local fixtures/button --runs 10 --sandbox docker

# Run all 10 attempts regardless (full distribution)
npx niceeval exp local fixtures/button --runs 10 --no-early-exit

# Run the comparison experiment but cap spend at $5
npx niceeval exp compare --budget 5.00

​Commands

npx niceeval exp <group> [id-prefix]

npx niceeval init

npx niceeval list

npx niceeval clean

npx niceeval view

​Output language

​npx niceeval exp [group|config] [id-prefix...]

​Flags

​Experiment & sandbox

​Run control

​Performance & limits

​CI & reporting

​npx niceeval init

​npx niceeval list

​npx niceeval exp <group>

​npx niceeval view

​Exit codes

​Common patterns

Commands

Output language

`npx niceeval exp [group|config] [id-prefix...]`

Flags

Experiment & sandbox

Run control

Performance & limits

CI & reporting

`npx niceeval init`

`npx niceeval list`

`npx niceeval exp <group>`

`npx niceeval view`

Exit codes

Common patterns