niceeval runner 如何调度和执行 eval

Runner 是把 eval 文件变成一次实际运行的执行引擎。它负责发现、过滤、调度、缓存、attempt、reporters 和 artifact 输出。

发现

runner 会读取 evals/ 下的：

npx niceeval list

list 只发现和打印 ID，不执行 eval。

exp 命令中,实验名之后的位置参数是 eval ID 前缀：

npx niceeval exp local
npx niceeval exp local weather
npx niceeval exp local fixtures/button

不要把 agent 名放进位置参数。agent/model/flags 写进 experiment。

npx niceeval exp local --max-concurrency 8

远程 HTTP agent 可以用更高并发；本地 Docker sandbox 通常需要更低并发，避免 CPU、内存和磁盘竞争。

npx niceeval exp local fixtures/button --runs 5 --early-exit

runs 用于测 pass rate。early-exit 会在某个 attempt 通过后停止同一 eval 的剩余尝试。

niceeval 可以根据输入、配置和相关文件 fingerprint 跳过已通过结果。缓存适合加速迭代，但如果你在调试非确定性行为，应该明确关闭或清理相关缓存。

npx niceeval exp local --timeout-ms 300000 --budget 5

超时保护单个 eval，预算保护整次运行成本。

runner 在 eval 完成后把结果交给 reporters：

每次运行会写入 .niceeval/<timestamp>/，包括 summary、per-eval result、event stream、transcript、diff 和测试输出。