> ## Documentation Index
> Fetch the complete documentation index at: https://niceeval.com/docs/llms.txt
> Use this file to discover all available pages before exploring further.

# niceeval scoring: assertions, judge calls, and outcomes

> niceeval scoring: five mechanisms — value assertions, scoped assertions, LLM-as-judge, test-as-scoring, and efficiency checks. Gate vs soft severity.

Scoring is the process of taking everything an agent did during an eval — every message, tool call, file change, and token spent — and folding it into a single, actionable outcome. niceeval gives you five scoring mechanisms that complement each other: some check values immediately, some assess the whole run after it completes, some ask a language model to judge open-ended quality, some execute tests inside the sandbox, and some measure efficiency. All five produce the same `Assertion` type, and all five feed into the same outcome rules.

## The five scoring mechanisms

<CardGroup cols={2}>
  <Card title="1. Value assertions" icon="equals">
    `t.check(value, matcher)` and `t.require(value, matcher)` evaluate a specific value immediately against a matcher from `niceeval/expect`. Use these for facts you can verify inline.
  </Card>

  <Card title="2. Scoped assertions" icon="crosshairs">
    `t.succeeded()`, `t.calledTool()`, `t.messageIncludes()`, and friends are registered during `test(t)` but evaluated **after** the function returns, against the complete turn data. Use these for whole-run facts.
  </Card>

  <Card title="3. LLM-as-judge" icon="gavel">
    `t.judge.autoevals.factuality()`, `t.judge.autoevals.closedQA()`, `t.judge.autoevals.summarizes()`, and `t.judge.autoevals.closedQA()` ask a separate evaluator model to score open-ended output. The judge model is fully independent from the agent under test.
  </Card>

  <Card title="4. Test-as-scoring" icon="flask">
    For sandbox evals, `EVAL.ts` is a Vitest test file that runs inside the sandbox after the agent finishes. Every `test()` in `EVAL.ts` becomes a gate assertion. Use this for coding tasks where file content and build results are the ground truth.
  </Card>

  <Card title="5. Efficiency assertions" icon="gauge">
    `t.maxTokens()` and `t.maxCost()` turn token usage and estimated cost into scoreable dimensions. An agent that answers correctly but burns ten times the expected tokens should not score the same as one that answers efficiently.
  </Card>
</CardGroup>

## Gate vs soft severity

Every assertion carries a **severity** that determines how it influences the final outcome. There are exactly two severities:

<Tabs>
  <Tab title="gate">
    A gate assertion is a hard requirement. If it fails, the entire eval is immediately classified as `failed` — regardless of how well every other assertion passed. Use gate for facts that must be true: "the agent called the correct tool", "the response parsed as valid JSON", "no shell commands errored."

    Most matchers in `niceeval/expect` (`includes`, `equals`, `matches`, `satisfies`) default to gate. Scoped assertions like `t.succeeded()` and `t.calledTool()` also default to gate.
  </Tab>

  <Tab title="soft">
    A soft assertion is a quality score with a numeric threshold. If the score falls below the threshold, the eval becomes `passed` rather than `passed` — a signal that there is a quality regression, but not a hard breakage. Soft failures only count as failures when you run with `--strict`.

    Use soft for continuous judgments where the answer is "how good" rather than "correct or not": similarity scoring, LLM-as-judge factuality ratings, cost budgets you want to track without blocking CI.

    Matchers that produce a continuous score (`similarity`) and all judge calls default to soft.
  </Tab>
</Tabs>

You can override the default severity with a chain method on any matcher or assertion:

```ts theme={null}
t.check(t.reply, includes("confirmed"));          // gate (default)
t.check(t.reply, similarity(expected).gate());    // promote to gate
t.judge.autoevals.closedQA("Is the tone professional?").atLeast(0.7);  // soft, threshold 0.7
t.maxTokens(80_000).atLeast(0.7);                       // demote to soft
```

## Outcome rules

Once all assertions are collected, `outcome.ts` folds them into a single outcome in this order:

```
Execution error (timeout / thrown exception / author mistake)  →  failed
Any gate assertion failed                                       →  failed
Explicit t.skip(reason) was called                             →  skipped
All gates passed, but at least one soft is below its threshold →  passed
Otherwise                                                       →  passed
```

<CardGroup cols={2}>
  <Card title="passed" icon="circle-check" color="#22c55e">
    No errors, all gate assertions passed, all soft assertions met their thresholds.
  </Card>

  <Card title="failed" icon="circle-xmark" color="#ef4444">
    Execution error **or** at least one gate assertion did not pass. Hard failure.
  </Card>

  <Card title="passed" icon="chart-bar" color="#f59e0b">
    All gates passed, but at least one soft fell below its threshold. Quality regression — silent by default, red under `--strict`.
  </Card>

  <Card title="skipped" icon="forward" color="#6b7280">
    `t.skip("reason")` was called. Excluded from pass-rate calculations entirely.
  </Card>
</CardGroup>

When you run an eval more than once (`runs > 1`), the per-eval summary becomes a **pass rate** (the fraction of runs that produced `passed`) and an average latency, rather than a single outcome.

## 1. Value assertions — `niceeval/expect` matchers

`t.check(value, assertion)` evaluates the assertion immediately and records the result. `t.require(value, assertion)` does the same but **throws immediately** if the assertion fails, aborting the rest of the test function. Use `t.require` for preconditions: if a required fact is false, there is no point continuing.

The matchers available from `niceeval/expect`:

```ts theme={null}
import {
  includes,    // substring or regex match          (default: gate)
  equals,      // deep equality                     (default: gate)
  matches,     // Standard Schema (Zod etc.) check  (default: gate)
  similarity,  // normalized Levenshtein 0–1        (default: soft)
  satisfies,   // custom predicate + label          (default: gate)
} from "niceeval/expect";
```

Usage examples:

```ts theme={null}
// Check that the agent's reply contains a specific string
t.check(t.reply, includes("order confirmed"));

// Deep-equal check on structured output
t.check(turn.data, equals({ status: "refund", amount: 42 }));

// Validate structured output against a Zod schema
t.check(turn.data, matches(z.object({ intent: z.enum(["refund", "ship"]) })));

// Similarity with an explicit threshold
t.check(t.reply, similarity("expected answer").atLeast(0.8));

// Custom predicate
t.check(turn.data, satisfies((d) => d.total > 0, "total is positive"));
```

Matchers are pure functions — `(value) => number` — so you can write your own and pass them to `t.check` without any special registration.

## 2. Scoped assertions

Scoped assertions are registered during `test(t)` but evaluated **after the function returns**, against the complete accumulated turn data. They read from the standard event stream and its derived facts — so as long as your adapter produces correct events, these assertions work identically for every agent.

<Warning>
  Scoped assertions only appear on `t` if the agent has declared the corresponding capability. Calling `t.calledTool()` when the agent has not declared `toolObservability: true` is a compile error.
</Warning>

### Run / session dimension

```ts theme={null}
t.succeeded();                  // run completed with no failed actions and no unresolved HITL
t.parked();                     // cleanly stopped on a HITL input.requested event
t.messageIncludes("Regards,");  // all message events concatenated contain this string/regex
```

### Tool / action dimension

```ts theme={null}
t.calledTool("bash", { input: { command: /^pwd/ }, count: 1 });
t.notCalledTool("shell", { input: { command: /npm i/ } });
t.toolOrder(["read_file", "write_file"]);   // relative order of tool calls
t.usedNoTools();
t.maxToolCalls(5);
t.loadedSkill("memory-v2");                // sugar for calledTool("load_skill", ...)
t.calledSubagent("researcher", { remoteUrl: /api\.example/ });
t.noFailedActions();                       // no tool, subagent, or skill has status "failed"
```

The `input` argument to `calledTool` and `notCalledTool` supports a small matching language: a plain object performs deep partial matching, a `RegExp` matches against the serialized input, and a predicate function receives the raw input value.

### Event stream dimension (low-level escape hatch)

```ts theme={null}
t.event("input.requested", { count: 1 });
t.notEvent("error");
t.eventOrder(["action.called", "subagent.called"]);
t.eventsSatisfy("read before write", (events) => /* custom predicate */ true);
```

All scoped assertions above are syntactic sugar for these low-level event stream queries. When none of the higher-level assertions fit your use case, you can drop down to `eventsSatisfy` and write an arbitrary predicate over the raw `StreamEvent[]`.

### Structured output (on `turn`, not `t`)

```ts theme={null}
const turn = await t.send("Return the result as JSON");
turn.outputEquals({ status: "ok" });                         // deep equality on turn.data
turn.outputMatches(z.object({ status: z.string() }));        // Standard Schema validation
```

### Workspace dimension (sandbox agents only)

```ts theme={null}
t.fileChanged("src/Button.tsx");
t.fileDeleted("src/old.ts");
t.sandbox.diff.isEmpty();                      // no repository files were modified this turn
t.notInDiff(/sk-[A-Za-z0-9]/);         // diff contains no secrets / inline styles
t.check(await t.sandbox.runCommand("npm", ["test"], { cwd: "/workspace" }), commandSucceeded());                       // EVAL.ts ran and all test() cases passed
t.check(await t.sandbox.runCommand("npm", ["run", "build"], { cwd: "/workspace" }), commandSucceeded());              // npm run build exited 0
t.noFailedShellCommands();
```

`t.sandbox.diff` is a queryable object. `t.sandbox.diff.get("src/Button.tsx")` returns the file's post-change content; `t.sandbox.diff.isEmpty()` checks whether any files changed; `t.sandbox.diff.matches(re)` and `t.notInDiff(re)` run a regex over the full diff text.

## 3. LLM-as-judge

Use judge assertions when correctness cannot be expressed as a rule — for open-ended prose, tone, factual consistency, or summarization quality. The evaluator model is **entirely separate** from the agent under test, preventing self-evaluation bias.

```ts theme={null}
t.judge.autoevals.factuality(reference).atLeast(0.8);   // factual consistency with a reference text
t.judge.autoevals.closedQA("Is the response appropriate for a 10-year-old?");
t.judge.autoevals.summarizes(sourceDocument);            // faithful summarization check
t.judge.autoevals.closedQA("Custom scoring rubric description", { on: t.reply });
```

The `{ on }` option specifies which value to evaluate (defaults to `t.reply`). The `{ model }` option lets you override the judge model for a single call.

### Judge model resolution

The judge model is resolved from most- to least-specific:

```
Per-call { model } option
  ↓
Per-eval judge.model
  ↓
Global config judge.model
```

```ts theme={null}
// niceeval.config.ts — global default
defineConfig({ judge: { model: "anthropic/claude-haiku-4-5" } });

// A specific eval that needs a more capable judge
defineEval({
  judge: { model: "anthropic/claude-opus-4-8" },
  async test(t) {
    t.judge.autoevals.factuality(reference);  // uses claude-opus-4-8 for this eval
  },
});

// A single call override
t.judge.autoevals.closedQA("rubric", { on: t.reply, model: "openai/gpt-4o" });
```

## 4. Test-as-scoring (sandbox evals)

For sandbox coding evals, the `EVAL.ts` file is a Vitest test suite that runs **inside the sandbox** after the agent completes its task. Every `test()` block in `EVAL.ts` becomes a gate assertion — pass means gate passes, fail means gate fails.

```ts theme={null}
// evals/fixtures/button/EVAL.ts — not visible to the agent during its run
import { test, expect } from "vitest";
import { existsSync, readFileSync } from "node:fs";

test("Button component exists", () => {
  expect(existsSync("src/components/Button.tsx")).toBe(true);
});

test("accepts label and onClick props", () => {
  const src = readFileSync("src/components/Button.tsx", "utf-8");
  expect(src).toContain("label");
  expect(src).toContain("onClick");
});
```

You can also assert **agent behavior** (not just output files) by reading the observability summary that niceeval injects into the sandbox:

```ts theme={null}
test("initialized with scaffold, not hand-written", () => {
  const o11y = JSON.parse(readFileSync("__niceeval__/results.json", "utf-8")).o11y;
  const cmds = o11y.shellCommands.map((c: { command: string }) => c.command);
  expect(cmds.some((c) => c.includes("create-next-app"))).toBe(true);
});
```

The `validation` field in the eval controls what gets run: `"vitest"` runs `EVAL.ts` plus any configured npm scripts; `"none"` runs only npm scripts.

## 5. Efficiency / cost assertions

Token usage is a first-class scoring dimension. An agent that answers correctly but burns far more tokens than expected should not score identically to one that answers efficiently.

```ts theme={null}
t.maxTokens(50_000);            // hard token limit for the entire run (gate by default)
t.maxCost(0.5);                 // estimated cost cap in USD (requires a price table in config)
t.maxTokens(80_000).atLeast(0.7);     // soft variant — tracked but only red under --strict
t.check(t.usage.outputTokens, satisfies((n) => n < 10_000, "not verbose"));
```

`t.usage` is available anywhere inside `test(t)` and exposes `{ inputTokens, outputTokens, cacheReadTokens?, … }`. For sandbox agents, token counts are extracted from the transcript by the adapter; for remote agents, they are returned in `Turn.usage`.

## Custom scorers

A value assertion is just a function `(value) => number | Promise<number>`. You can write custom matchers using `makeAssertion`:

```ts theme={null}
import { makeAssertion } from "niceeval/expect";
import type { Assertion } from "niceeval/expect";

function jsonValid(): Assertion {
  return makeAssertion({
    name: "jsonValid",
    severity: "gate",
    score: (value) => {
      try { JSON.parse(String(value)); return 1; }
      catch { return 0; }
    },
  });
}

t.check(t.reply, jsonValid());
```

Custom matchers compose with the same chain methods as built-ins: `.gate()`, `.atLeast(0.7)`, `.atLeast(threshold)`.

## Related pages

* [Evals](/concepts/evals) — how assertions fold into the eval lifecycle and outcome types.
* [Agents & Adapters](/concepts/agents-adapters) — how the standard event stream is produced, which scoped assertions depend on.
* [Overview](/concepts/overview) — the full architecture and where scoring fits.
