> ## Documentation Index
> Fetch the complete documentation index at: https://niceeval.com/docs/llms.txt
> Use this file to discover all available pages before exploring further.

# NiceEval 断言：值、作用域事实、测试与成本

> NiceEval 的断言词汇——值断言、作用域断言、test-as-scoring 和效率断言——以及 gate / soft 严重级和判决 outcome 的规则。

断言把 agent 在一次 eval 里做的一切——每条消息、每次工具调用、每处文件改动、每一分 token——折叠成一个可解释的结果。NiceEval 提供四种互补的断言机制：有的立即检查一个值，有的在整轮跑完后评估整次运行，有的在沙箱里跑测试，有的衡量效率。四种都产出同一种 `Assertion` 类型，都进同一套判决规则。第五种机制——让语言模型评判开放式质量——见 [Judge](/zh/concepts/judge)。

## 四种断言机制

<CardGroup cols={2}>
  <Card title="1. 值断言" icon="equals">
    `t.check(value, matcher)` 和 `t.require(value, matcher)` 立即针对 `niceeval/expect` 的匹配器评估一个具体值，适合能就地验证的事实。
  </Card>

  <Card title="2. 作用域断言" icon="crosshairs">
    `t.succeeded()`、`t.calledTool()`、`t.messageIncludes()` 等在 `test(t)` 里注册，但在函数返回**之后**才对完整轮次数据统一评估，适合整次运行的事实。
  </Card>

  <Card title="3. Test-as-scoring" icon="flask">
    在 sandbox eval 里，从 `test(t)` 跑项目测试、构建脚本或临时探针命令——适合代码任务，文件内容和构建结果就是事实标准。
  </Card>

  <Card title="4. 效率断言" icon="gauge">
    `t.maxTokens()` 和 `t.maxCost()` 把 token 用量和估算成本变成可评分的维度。答对了但烧掉十倍 token 的 agent，不该跟省着用的拿一样的分。
  </Card>
</CardGroup>

## gate 与 soft 严重级

每条断言都带一个**严重级**，决定它如何影响最终判决，只有两档：

<Tabs>
  <Tab title="gate">
    gate 是硬性要求。一旦失败，整个 eval 立刻判为 `failed`——不管其它断言表现如何。适合必须为真的事实：“调用了正确的工具”“输出解析为合法 JSON”“没有 shell 命令报错”。

    `niceeval/expect` 里大多数匹配器（`includes`、`equals`、`matches`、`satisfies`）默认 gate；`t.succeeded()`、`t.calledTool()` 这类作用域断言也默认 gate。
  </Tab>

  <Tab title="soft">
    soft 是带数值阈值的质量分。分数低于阈值时，eval 变成 `passed` 而不是 `failed`——是质量回归的信号，但不算硬性破坏。soft 失败只在 `--strict` 下才算数。

    适合“好不好”而不是“对不对”的连续判断：相似度打分、LLM-as-judge 的事实性评分、想跟踪但不想拦 CI 的成本预算。

    产出连续分数的匹配器（`similarity`）和所有 judge 调用默认走 soft。
  </Tab>
</Tabs>

在任意匹配器或断言上用链式方法覆盖默认严重级：

```ts theme={null}
t.check(t.reply, includes("confirmed"));          // 默认 gate
t.check(t.reply, similarity(expected).gate());    // 提升为 gate
t.maxTokens(80_000).atLeast(0.7);                 // 降为 soft 并带阈值
```

## 判决规则

所有断言收齐后，运行器按这个顺序折叠成一个结果：

```
执行出错(超时 / 异常 / 作者错误)                  → failed
显式调用了 t.skip(reason)                        → skipped
任一 gate 断言失败                                → failed
所有 gate 都过，但至少一个 soft 低于阈值           → passed(--strict 下标红)
否则                                              → passed
```

<CardGroup cols={2}>
  <Card title="passed" icon="circle-check" color="#22c55e">
    没有错误，所有 gate 断言通过，所有 soft 断言都达到阈值。
  </Card>

  <Card title="failed" icon="circle-xmark" color="#ef4444">
    执行出错，或至少一个 gate 断言没通过。硬失败。
  </Card>

  <Card title="passed（计分）" icon="chart-bar" color="#f59e0b">
    所有 gate 都过，但至少一个 soft 低于阈值。质量回归——默认不报红，`--strict` 下才报红。
  </Card>

  <Card title="skipped" icon="forward" color="#6b7280">
    调用了 `t.skip("reason")`。完全排除在通过率统计之外。
  </Card>
</CardGroup>

多次运行（`runs > 1`）时，单个 eval 的汇总变成**通过率**（产出 `passed` 的运行占比）和平均耗时，而不是单一 outcome。

## 1. 值断言 —— `niceeval/expect` 匹配器

`t.check(value, assertion)` 立即评估断言并记录结果。`t.require(value, assertion)` 做同样的事，但断言失败时**立即抛出**，中止 `test` 函数剩余部分——适合前置条件：一个必要事实不成立就没必要继续跑。

`niceeval/expect` 提供的匹配器：

```ts theme={null}
import {
  includes,    // 子串或正则匹配              (默认: gate)
  equals,      // 深度相等                    (默认: gate)
  matches,     // Standard Schema(Zod 等)校验 (默认: gate)
  similarity,  // 归一化 Levenshtein 0–1      (默认: soft)
  satisfies,   // 自定义谓词 + 标签            (默认: gate)
} from "niceeval/expect";
```

用法示例：

```ts theme={null}
// 检查回复包含某个字符串
t.check(t.reply, includes("订单已确认"));

// 结构化输出的深度相等
t.check(turn.data, equals({ status: "refund", amount: 42 }));

// 用 Zod schema 校验结构化输出
t.check(turn.data, matches(z.object({ intent: z.enum(["refund", "ship"]) })));

// 带显式阈值的相似度
t.check(t.reply, similarity("期望答案").atLeast(0.8));

// 自定义谓词
t.check(turn.data, satisfies((d) => d.total > 0, "total is positive"));
```

匹配器是纯函数——`(value) => number`——所以你可以自己写一个，不需要任何特殊注册就能传给 `t.check`。

## 2. 作用域断言

作用域断言在 `test(t)` 里注册，但在函数返回**之后**才对累积完的完整轮次数据评估。它们读的是 `t.send()` 产出的标准事件流（见 [Drive](/zh/concepts/drive)）及其派生事实——所以只要你的 adapter 产出正确的事件，这些断言对任何 agent 都一样好用。

<Warning>
  作用域断言只有在 agent 声明了对应能力时才出现在 `t` 上。agent 没声明 `toolObservability: true` 时调用 `t.calledTool()` 是编译期报错。
</Warning>

### 运行 / session 维度

```ts theme={null}
t.succeeded();                  // 运行完成，没有失败的 action，没有未解决的 HITL
t.parked();                     // 干净地停在 HITL 的 input.requested 事件上
t.messageIncludes("此致");       // 拼接全部 message 事件后包含这个字符串/正则
```

### 工具 / action 维度

```ts theme={null}
t.calledTool("bash", { input: { command: /^pwd/ }, count: 1 });
t.notCalledTool("shell", { input: { command: /npm i/ } });
t.toolOrder(["read_file", "write_file"]);   // 工具调用的相对顺序
t.usedNoTools();
t.maxToolCalls(5);
t.loadedSkill("memory-v2");                // calledTool("load_skill", ...) 的语法糖
t.calledSubagent("researcher", { remoteUrl: /api\.example/ });
t.noFailedActions();                       // 没有工具、子 agent 或 skill 状态是 "failed"
```

`calledTool` / `notCalledTool` 的 `input` 参数支持一套小型匹配语言：普通对象做深度部分匹配，`RegExp` 匹配序列化后的输入，谓词函数拿到原始 input 值。

### 事件流维度（低层逃生舱）

```ts theme={null}
t.event("input.requested", { count: 1 });
t.notEvent("error");
t.eventOrder(["action.called", "subagent.called"]);
t.eventsSatisfy("read before write", (events) => /* 自定义谓词 */ true);
```

以上所有作用域断言都是这几条低层事件流查询的语法糖。找不到合适的高层断言时，可以降到 `eventsSatisfy`，对原始 `StreamEvent[]` 写任意谓词。

### 结构化输出（挂在 `turn` 上，不是 `t`）

```ts theme={null}
const turn = await t.send("把结果按 JSON 返回");
turn.outputEquals({ status: "ok" });                         // turn.data 的深度相等
turn.outputMatches(z.object({ status: z.string() }));        // Standard Schema 校验
```

### 工作区维度（仅 sandbox agent）

```ts theme={null}
t.fileChanged("src/Button.tsx");
t.fileDeleted("src/old.ts");
t.sandbox.diff.isEmpty();                      // 这一轮没有仓库文件被改动
t.notInDiff(/sk-[A-Za-z0-9]/);        // diff 里不含密钥 / 内联样式
t.check(await t.sandbox.runCommand("npm", ["test"]), commandSucceeded());         // npm test 退出码 0
t.check(await t.sandbox.runCommand("npm", ["run", "build"]), commandSucceeded()); // npm run build 退出码 0
t.noFailedShellCommands();
```

`t.sandbox.diff` 是可查询对象：`t.sandbox.diff.get("src/Button.tsx")` 返回文件改动后的内容；`t.sandbox.diff.isEmpty()` 检查有没有文件变化；`t.sandbox.diff.matches(re)` 和 `t.notInDiff(re)` 对完整 diff 文本跑正则。

作用域断言到处遵守同一条规则：**接收者决定作用域，不是断言名字决定作用域。** `t.*` 聚合这次 eval run 的全部轮次（含 `t.newSession()` 开的额外 session）；`session.*`（`t.newSession()` 的返回值）只看这一条 session；`turn.*`（`t.send()` 的返回值）只看这一轮自己。同一套词汇，不同接收者——各接收者是什么见 [Drive](/zh/concepts/drive)。

## 3. Test-as-scoring（沙箱型 eval）

沙箱型代码 eval 里，在 `test(t)` 内跑验证命令，把结果记成断言：

```ts theme={null}
import { commandSucceeded, includes } from "niceeval/expect";

const testResult = await t.sandbox.runCommand("npm", ["test"]);
t.check(testResult, commandSucceeded());

const src = await t.sandbox.readSourceFiles({ extensions: ["ts"] });
t.check(src.text(), includes(/z\.object\s*\(/));
```

也可以通过标准事件流断言行为：`t.calledTool(...)`、`t.noFailedShellCommands()`、`t.eventsSatisfy(...)`，以及 `t.fileChanged(...)` 这类 diff 断言。

## 4. 效率 / 成本断言

token 用量是一等评分维度。答对了但花费远超预期的 agent，不该跟省着用的拿一样的分。

```ts theme={null}
t.maxTokens(50_000);            // 整次运行的硬 token 上限（默认 gate）
t.maxCost(0.5);                 // 估算成本上限，美元（需要在配置里提供价格表）
t.maxTokens(80_000).atLeast(0.7);     // soft 变体——照样跟踪，只有 --strict 下才报红
t.check(t.usage.outputTokens, satisfies((n) => n < 10_000, "not verbose"));
```

`t.usage` 在 `test(t)` 里随处可用，暴露 `{ inputTokens, outputTokens, cacheReadTokens?, … }`。沙箱型 agent 的 token 数由 adapter 从 transcript 里抠出；远程 agent 直接在 `Turn.usage` 里返回。

## 自定义评分器

值断言就是一个函数 `(value) => number | Promise<number>`，用 `makeAssertion` 自己写：

```ts theme={null}
import { makeAssertion } from "niceeval/expect";
import type { Assertion } from "niceeval/expect";

function jsonValid(): Assertion {
  return makeAssertion({
    name: "jsonValid",
    severity: "gate",
    score: (value) => {
      try { JSON.parse(String(value)); return 1; }
      catch { return 0; }
    },
  });
}

t.check(t.reply, jsonValid());
```

自定义匹配器和内置的一样支持链式方法：`.gate()`、`.atLeast(0.7)`。

## 相关阅读

* [Drive](/zh/concepts/drive) — `t.send()`、`t.newSession()` 和 HITL：这些断言读的 Turn 数据是怎么产出的。
* [Judge](/zh/concepts/judge) — 第五种评分机制，评无法写成固定规则的开放式质量。
* [Agents & Adapters](/zh/concepts/agents-adapters) — 标准事件流如何产出，作用域断言依赖它什么。
* [Evals](/zh/concepts/evals) — 断言如何折进 eval 生命周期和 outcome 类型。