Skip to main content
The niceeval CLI is your single entry point for discovering, running, and inspecting evals. Actual eval execution is experiment-first: exp selects the signed-in run configuration, and any positional arguments after the experiment select evals by ID prefix. Agent, model, and feature flags live in experiments/, not in ad-hoc CLI arguments.

Commands

npx niceeval exp <group> [id-prefix]

Run a named experiment group or config. Optional trailing positionals filter eval IDs.

npx niceeval init

Scaffold evals/, niceeval.config.ts, and example eval files.

npx niceeval list

Discover and print all evals without running them.

npx niceeval clean

Delete .niceeval/ historical run artifacts.

npx niceeval view

Open the result viewer to inspect artifacts from the last run.

Output language

niceeval localizes CLI and runtime messages without adding a config key. Set an environment variable when you want deterministic output:
NICEEVAL_LANG=en npx niceeval list
NICEEVAL_LANG=zh-CN npx niceeval list
Locale detection order is NICEEVAL_LANG, NICEEVAL_LOCALE, LC_ALL, LC_MESSAGES, then LANG. Values starting with zh use zh-CN; other languages use en. If none are set, niceeval defaults to zh-CN. This affects terminal/runtime text only; machine result fields and LLM judge prompts are not translated.

npx niceeval exp [group|config] [id-prefix...]

The exp command runs configs from experiments/. The first positional argument selects an experiment group or a single config; any remaining positionals narrow the run to evals whose IDs match those prefixes.
# Run every experiment under experiments/
npx niceeval exp

# Run one experiment group
npx niceeval exp compare-models

# Run only evals whose ID starts with "weather" inside that group
npx niceeval exp compare-models weather
Eval filters only appear after the experiment selector. Bare npx niceeval weather does not run; use npx niceeval exp local weather or npx niceeval exp compare weather.

Flags

Experiment & sandbox

--agent
string
Not supported for experiment runs. Add or copy an experiment file and set the agent there so the run configuration is reviewable and reproducible.
--sandbox
string
Temporarily choose the sandbox backend for the selected experiment. Accepted values: docker, vercel, auto.
  • docker — spin up a local Docker container (no cloud credentials needed).
  • vercel — use Vercel Sandbox (requires VERCEL_TOKEN or VERCEL_OIDC_TOKEN).
  • auto — detect automatically: use vercel when a Vercel token is present, otherwise fall back to docker.
npx niceeval exp local --sandbox docker
npx niceeval exp local --sandbox auto

Run control

--watch
boolean
Re-run matching evals whenever a source file changes. Useful during authoring to get instant feedback as you edit.
npx niceeval exp local weather --watch
--dry
boolean
Discover and print the evals that would be run, but do not execute them. Good for verifying your ID prefix and tag filters before a long run.
npx niceeval exp compare --dry
npx niceeval exp compare fixtures --dry
--force
boolean
Bypass the result cache. Normally, evals whose fixture content and config have not changed since their last passing run are skipped automatically. Use --force to always re-run everything regardless of cached results.
npx niceeval exp compare --force
--runs
number
Number of times to run each matched eval. Use this when you care about pass rate rather than a single pass/fail outcome. Results are grouped and reported as pass@N.
# Run each eval 5 times and report pass@5
npx niceeval exp compare fixtures/button --runs 5
--no-early-exit
boolean
Disable early-exit. By default, once one attempt for a given eval passes, niceeval cancels the remaining attempts for that eval. Pass this flag to run all --runs attempts regardless, which gives you the full pass rate distribution.
npx niceeval exp compare fixtures/button --runs 10 --no-early-exit
--tag
string
Filter evals by tag. Only evals that declare the given tag in their definition will be included in the run.
npx niceeval exp compare --tag smoke

Performance & limits

--max-concurrency
number
Override the maxConcurrency value set in niceeval.config.ts. Controls how many eval attempts run in parallel.
npx niceeval exp compare --max-concurrency 4
--timeout
number
Per-eval timeout in milliseconds. If an eval takes longer than this value, it is force-cancelled and marked as failed with error: timeout. Overrides timeoutMs from config.
# 10-minute timeout per eval
npx niceeval exp compare --timeout 600000
--budget
number
Cost limit in USD for the entire run. niceeval tracks cumulative estimated spend across all attempts. Once the budget is exceeded, no new attempts are dispatched; in-flight attempts complete normally. Overrides budget from config.
npx niceeval exp compare --budget 2.50

CI & reporting

--strict
boolean
Treat passed outcomes (where the LLM-as-judge score falls below the configured threshold) as failures. Without --strict, passed-but-below- threshold evals produce a non-zero exit code only when their outcome is explicitly failed. With --strict, they also cause a non-zero exit. Use this flag in CI to enforce quality thresholds.
npx niceeval exp ci --strict --junit .niceeval/junit.xml
--junit
string
Write a JUnit XML report to the specified path. Compatible with GitHub Actions, CircleCI, and any CI system that ingests JUnit reports.
npx niceeval exp ci --junit .niceeval/junit.xml

npx niceeval init

Scaffolds the minimum files needed to start writing evals in the current repository. Running init creates:
your-project/
├─ niceeval.config.ts
└─ evals/
   ├─ hello.eval.ts            # example: conversational eval
   └─ fixtures/
      └─ button/               # example: sandbox coding-agent eval
         ├─ PROMPT.md
         ├─ EVAL.ts
         └─ package.json
npm install -D niceeval
npx niceeval init
After running init, add an experiment under experiments/ that selects the agent, model, runs, and sandbox for your first run.

npx niceeval list

Discovers all evals (and, if an experiments/ directory exists, all experiment files) and prints their IDs. Does not run anything. Use this to verify discovery before a run.
npx niceeval list

# Output:
# classify
# weather/brooklyn
# weather/manhattan
# fixtures/button
Use npx niceeval exp <group> --dry to preview how a specific experiment will filter and expand evals.

npx niceeval exp <group>

Runs an experiment group. niceeval scans the experiments/ directory for TypeScript files whose default export is a defineExperiment call. The <group> argument is the folder name — all experiment files inside that folder are treated as one comparable group.
# Run all experiments in experiments/compare/
npx niceeval exp compare
Experiments expand into a matrix of agent × model × eval × runs attempts. Results are reported as pass rates grouped by (agent, model, eval), rather than single pass/fail outcomes.

npx niceeval view

Opens the result viewer in your browser. The viewer reads artifacts from the most recent .niceeval/<timestamp>/ output directory and lets you inspect per-eval results, transcripts, file diffs, and event streams.
npx niceeval view

Exit codes

niceeval’s exit code lets CI systems gate on eval results without parsing output.
ConditionExit code
All evals passed or passed (without --strict)0
Any eval failednon-zero
Any eval passed below threshold with --strictnon-zero

Common patterns

# Watch a specific eval while you develop
npx niceeval exp local fixtures/button --sandbox docker --watch

# Dry-run to check which evals will run
npx niceeval exp local --dry

# Force a re-run ignoring the cache
npx niceeval exp local fixtures/button --force