niceeval 中的 eval: 生命周期、outcome 与文件

一个 eval 是一个可运行的测试用例。它通常由一个 *.eval.ts 文件导出，通过 defineEval 声明。

eval 的组成

import { defineEval } from "niceeval";
import { includes } from "niceeval/expect";

export default defineEval({
  description: "Brooklyn weather query",
  async test(t) {
    await t.send("What's the weather like in Brooklyn today?");
    t.succeeded();
    t.calledTool("get_weather", { input: { city: "Brooklyn" }, count: 1 });
    t.check(t.reply, includes("sunny"));
  },
});

核心字段：

字段	说明
`description`	给人看的描述，出现在报告里
`agent`	使用哪个 agent adapter，可由配置或 CLI 覆盖
`test(t)`	交互和断言逻辑

不要手写 id 或 name。niceeval 从文件路径推导 ID。

路径即身份

evals/weather/brooklyn.eval.ts 的 ID 是 weather/brooklyn。experiment 名之后的位置参数按 ID 前缀过滤：

npx niceeval exp local weather
npx niceeval exp local weather/brooklyn

这种方式让 ID 稳定、可读，并自然跟目录结构保持一致。

生命周期

Discovery

runner 加载 evals/ 下的 *.eval.ts 文件和 fixture 目录。

Scheduling

结合并发、缓存、runs、attempt 和 early-exit 生成执行计划。

agent.send

t.send() 调用被选中的 agent adapter，并得到标准 Turn。

Scoring

niceeval 收集值断言、作用域断言、judge 分数和测试结果。

Outcome

所有断言结果折叠成一个最终 outcome。

Report

控制台和 reporters 输出结果，同时写入 .niceeval/ artifacts。

Outcome 类型

passed

所有 gate 断言通过，并且没有未满足的硬失败。

failed

至少一个 gate 断言失败，或运行本身失败。

passed

没有 gate 失败，但 soft 分数需要保留为分值。

skipped

eval 主动跳过，通常通过 t.skip(reason)。

gate 与 soft

gate 是硬门槛，失败会让 eval 失败；soft 参与打分，但不一定让 eval 失败。完整规则见评分。

`*.eval.ts` 约定

只有以 .eval.ts 结尾的文件会被发现。用目录表达分组：

evals/
└─ billing/
   └─ refund.eval.ts  # id: billing/refund

数组导出与 Dataset Fanout

一个文件也可以默认导出 defineEval(...) 数组，用同一套逻辑生成多个 case：

export default rows.map((row) =>
  defineEval({
    description: row.task,
    async test(t) {
      await t.send(row.prompt);
      t.check(t.reply, equals(row.expected));
    },
  }),
);

生成 ID 类似 sql/0000、sql/0001。详见 Dataset Fanout。

​eval 的组成

​路径即身份

​生命周期

​Outcome 类型

passed

failed

passed

skipped

​gate 与 soft

​*.eval.ts 约定

​数组导出与 Dataset Fanout

eval 的组成

路径即身份

生命周期

Outcome 类型

gate 与 soft

`*.eval.ts` 约定

数组导出与 Dataset Fanout