exp selects the signed-in run configuration, and any positional arguments after the experiment select evals by ID prefix. Agent, model, and feature flags live in experiments/, not in ad-hoc CLI arguments.
Commands
npx niceeval exp <group> [id-prefix]
Run a named experiment group or config. Optional trailing positionals filter eval IDs.
npx niceeval init
Scaffold
evals/, niceeval.config.ts, and example eval files.npx niceeval list
Discover and print all evals without running them.
npx niceeval clean
Delete
.niceeval/ historical run artifacts.npx niceeval view
Open the result viewer to inspect artifacts from the last run.
Output language
niceeval localizes CLI and runtime messages without adding a config key. Set an environment variable when you want deterministic output:NICEEVAL_LANG, NICEEVAL_LOCALE, LC_ALL,
LC_MESSAGES, then LANG. Values starting with zh use zh-CN; other
languages use en. If none are set, niceeval defaults to zh-CN. This affects
terminal/runtime text only; machine result fields and LLM judge prompts are not
translated.
npx niceeval exp [group|config] [id-prefix...]
The exp command runs configs from experiments/. The first positional argument selects an experiment group or a single config; any remaining positionals narrow the run to evals whose IDs match those prefixes.
Eval filters only appear after the experiment selector. Bare
npx niceeval weather does not run; use npx niceeval exp local weather or npx niceeval exp compare weather.Flags
Experiment & sandbox
Not supported for experiment runs. Add or copy an experiment file and set the
agent there so the run configuration is reviewable and reproducible.
Temporarily choose the sandbox backend for the selected experiment. Accepted values:
docker, vercel, auto.docker— spin up a local Docker container (no cloud credentials needed).vercel— use Vercel Sandbox (requiresVERCEL_TOKENorVERCEL_OIDC_TOKEN).auto— detect automatically: usevercelwhen a Vercel token is present, otherwise fall back todocker.
Run control
Re-run matching evals whenever a source file changes. Useful during authoring
to get instant feedback as you edit.
Discover and print the evals that would be run, but do not execute them.
Good for verifying your ID prefix and tag filters before a long run.
Bypass the result cache. Normally, evals whose fixture content and config
have not changed since their last passing run are skipped automatically.
Use
--force to always re-run everything regardless of cached results.Number of times to run each matched eval. Use this when you care about
pass rate rather than a single pass/fail outcome. Results are grouped and
reported as
pass@N.Disable early-exit. By default, once one attempt for a given eval passes,
niceeval cancels the remaining attempts for that eval. Pass this flag to run
all
--runs attempts regardless, which gives you the full pass rate
distribution.Filter evals by tag. Only evals that declare the given tag in their definition
will be included in the run.
Performance & limits
Override the
maxConcurrency value set in niceeval.config.ts. Controls
how many eval attempts run in parallel.Per-eval timeout in milliseconds. If an eval takes longer than this value,
it is force-cancelled and marked as
failed with error: timeout.
Overrides timeoutMs from config.Cost limit in USD for the entire run. niceeval tracks cumulative estimated
spend across all attempts. Once the budget is exceeded, no new attempts are
dispatched; in-flight attempts complete normally. Overrides
budget from
config.CI & reporting
Treat
passed outcomes (where the LLM-as-judge score falls below the
configured threshold) as failures. Without --strict, passed-but-below-
threshold evals produce a non-zero exit code only when their outcome is
explicitly failed. With --strict, they also cause a non-zero exit. Use
this flag in CI to enforce quality thresholds.Write a JUnit XML report to the specified path. Compatible with GitHub
Actions, CircleCI, and any CI system that ingests JUnit reports.
npx niceeval init
Scaffolds the minimum files needed to start writing evals in the current repository. Running init creates:
npx niceeval list
Discovers all evals (and, if an experiments/ directory exists, all experiment files) and prints their IDs. Does not run anything. Use this to verify discovery before a run.
npx niceeval exp <group> --dry to preview how a specific experiment will filter and expand evals.
npx niceeval exp <group>
Runs an experiment group. niceeval scans the experiments/ directory for TypeScript files whose default export is a defineExperiment call. The <group> argument is the folder name — all experiment files inside that folder are treated as one comparable group.
Experiments expand into a matrix of
agent × model × eval × runs attempts.
Results are reported as pass rates grouped by (agent, model, eval), rather
than single pass/fail outcomes.npx niceeval view
Opens the result viewer in your browser. The viewer reads artifacts from the most recent .niceeval/<timestamp>/ output directory and lets you inspect per-eval results, transcripts, file diffs, and event streams.
Exit codes
niceeval’s exit code lets CI systems gate on eval results without parsing output.| Condition | Exit code |
|---|---|
All evals passed or passed (without --strict) | 0 |
| Any eval failed | non-zero |
Any eval passed below threshold with --strict | non-zero |
Common patterns
- Local development
- CI pipeline
- Pass rate measurement
- Cost-guarded experiment