niceeval scoring: assertions, judge calls, and outcomes

Scoring is the process of taking everything an agent did during an eval — every message, tool call, file change, and token spent — and folding it into a single, actionable outcome. niceeval gives you five scoring mechanisms that complement each other: some check values immediately, some assess the whole run after it completes, some ask a language model to judge open-ended quality, some execute tests inside the sandbox, and some measure efficiency. All five produce the same Assertion type, and all five feed into the same outcome rules.

The five scoring mechanisms

1. Value assertions

t.check(value, matcher) and t.require(value, matcher) evaluate a specific value immediately against a matcher from niceeval/expect. Use these for facts you can verify inline.

2. Scoped assertions

t.succeeded(), t.calledTool(), t.messageIncludes(), and friends are registered during test(t) but evaluated after the function returns, against the complete turn data. Use these for whole-run facts.

3. LLM-as-judge

t.judge.autoevals.factuality(), t.judge.autoevals.closedQA(), t.judge.autoevals.summarizes(), and t.judge.autoevals.closedQA() ask a separate evaluator model to score open-ended output. The judge model is fully independent from the agent under test.

4. Test-as-scoring

For sandbox evals, EVAL.ts is a Vitest test file that runs inside the sandbox after the agent finishes. Every test() in EVAL.ts becomes a gate assertion. Use this for coding tasks where file content and build results are the ground truth.

5. Efficiency assertions

t.maxTokens() and t.maxCost() turn token usage and estimated cost into scoreable dimensions. An agent that answers correctly but burns ten times the expected tokens should not score the same as one that answers efficiently.

Gate vs soft severity

Every assertion carries a severity that determines how it influences the final outcome. There are exactly two severities:

gate
soft

A gate assertion is a hard requirement. If it fails, the entire eval is immediately classified as failed — regardless of how well every other assertion passed. Use gate for facts that must be true: “the agent called the correct tool”, “the response parsed as valid JSON”, “no shell commands errored.”Most matchers in niceeval/expect (includes, equals, matches, satisfies) default to gate. Scoped assertions like t.succeeded() and t.calledTool() also default to gate.

A soft assertion is a quality score with a numeric threshold. If the score falls below the threshold, the eval becomes passed rather than passed — a signal that there is a quality regression, but not a hard breakage. Soft failures only count as failures when you run with --strict.Use soft for continuous judgments where the answer is “how good” rather than “correct or not”: similarity scoring, LLM-as-judge factuality ratings, cost budgets you want to track without blocking CI.Matchers that produce a continuous score (similarity) and all judge calls default to soft.

You can override the default severity with a chain method on any matcher or assertion:

t.check(t.reply, includes("confirmed"));          // gate (default)
t.check(t.reply, similarity(expected).gate());    // promote to gate
t.judge.autoevals.closedQA("Is the tone professional?").atLeast(0.7);  // soft, threshold 0.7
t.maxTokens(80_000).atLeast(0.7);                       // demote to soft

Outcome rules

Once all assertions are collected, outcome.ts folds them into a single outcome in this order:

Execution error (timeout / thrown exception / author mistake)  →  failed
Any gate assertion failed                                       →  failed
Explicit t.skip(reason) was called                             →  skipped
All gates passed, but at least one soft is below its threshold →  passed
Otherwise                                                       →  passed

passed

No errors, all gate assertions passed, all soft assertions met their thresholds.

failed

Execution error or at least one gate assertion did not pass. Hard failure.

passed

All gates passed, but at least one soft fell below its threshold. Quality regression — silent by default, red under --strict.

skipped

t.skip("reason") was called. Excluded from pass-rate calculations entirely.

When you run an eval more than once (runs > 1), the per-eval summary becomes a pass rate (the fraction of runs that produced passed) and an average latency, rather than a single outcome.

1. Value assertions — `niceeval/expect` matchers

t.check(value, assertion) evaluates the assertion immediately and records the result. t.require(value, assertion) does the same but throws immediately if the assertion fails, aborting the rest of the test function. Use t.require for preconditions: if a required fact is false, there is no point continuing. The matchers available from niceeval/expect:

import {
  includes,    // substring or regex match          (default: gate)
  equals,      // deep equality                     (default: gate)
  matches,     // Standard Schema (Zod etc.) check  (default: gate)
  similarity,  // normalized Levenshtein 0–1        (default: soft)
  satisfies,   // custom predicate + label          (default: gate)
} from "niceeval/expect";

Usage examples:

// Check that the agent's reply contains a specific string
t.check(t.reply, includes("order confirmed"));

// Deep-equal check on structured output
t.check(turn.data, equals({ status: "refund", amount: 42 }));

// Validate structured output against a Zod schema
t.check(turn.data, matches(z.object({ intent: z.enum(["refund", "ship"]) })));

// Similarity with an explicit threshold
t.check(t.reply, similarity("expected answer").atLeast(0.8));

// Custom predicate
t.check(turn.data, satisfies((d) => d.total > 0, "total is positive"));

Matchers are pure functions — (value) => number — so you can write your own and pass them to t.check without any special registration.

2. Scoped assertions

Scoped assertions are registered during test(t) but evaluated after the function returns, against the complete accumulated turn data. They read from the standard event stream and its derived facts — so as long as your adapter produces correct events, these assertions work identically for every agent.

Scoped assertions only appear on t if the agent has declared the corresponding capability. Calling t.calledTool() when the agent has not declared toolObservability: true is a compile error.

Run / session dimension

t.succeeded();                  // run completed with no failed actions and no unresolved HITL
t.parked();                     // cleanly stopped on a HITL input.requested event
t.messageIncludes("Regards,");  // all message events concatenated contain this string/regex

Tool / action dimension

t.calledTool("bash", { input: { command: /^pwd/ }, count: 1 });
t.notCalledTool("shell", { input: { command: /npm i/ } });
t.toolOrder(["read_file", "write_file"]);   // relative order of tool calls
t.usedNoTools();
t.maxToolCalls(5);
t.loadedSkill("memory-v2");                // sugar for calledTool("load_skill", ...)
t.calledSubagent("researcher", { remoteUrl: /api\.example/ });
t.noFailedActions();                       // no tool, subagent, or skill has status "failed"

The input argument to calledTool and notCalledTool supports a small matching language: a plain object performs deep partial matching, a RegExp matches against the serialized input, and a predicate function receives the raw input value.

Event stream dimension (low-level escape hatch)

t.event("input.requested", { count: 1 });
t.notEvent("error");
t.eventOrder(["action.called", "subagent.called"]);
t.eventsSatisfy("read before write", (events) => /* custom predicate */ true);

All scoped assertions above are syntactic sugar for these low-level event stream queries. When none of the higher-level assertions fit your use case, you can drop down to eventsSatisfy and write an arbitrary predicate over the raw StreamEvent[].

Structured output (on `turn`, not `t`)

const turn = await t.send("Return the result as JSON");
turn.outputEquals({ status: "ok" });                         // deep equality on turn.data
turn.outputMatches(z.object({ status: z.string() }));        // Standard Schema validation

Workspace dimension (sandbox agents only)

t.fileChanged("src/Button.tsx");
t.fileDeleted("src/old.ts");
t.sandbox.diff.isEmpty();                      // no repository files were modified this turn
t.notInDiff(/sk-[A-Za-z0-9]/);         // diff contains no secrets / inline styles
t.check(await t.sandbox.runCommand("npm", ["test"], { cwd: "/workspace" }), commandSucceeded());                       // EVAL.ts ran and all test() cases passed
t.check(await t.sandbox.runCommand("npm", ["run", "build"], { cwd: "/workspace" }), commandSucceeded());              // npm run build exited 0
t.noFailedShellCommands();

t.sandbox.diff is a queryable object. t.sandbox.diff.get("src/Button.tsx") returns the file’s post-change content; t.sandbox.diff.isEmpty() checks whether any files changed; t.sandbox.diff.matches(re) and t.notInDiff(re) run a regex over the full diff text.

3. LLM-as-judge

Use judge assertions when correctness cannot be expressed as a rule — for open-ended prose, tone, factual consistency, or summarization quality. The evaluator model is entirely separate from the agent under test, preventing self-evaluation bias.

t.judge.autoevals.factuality(reference).atLeast(0.8);   // factual consistency with a reference text
t.judge.autoevals.closedQA("Is the response appropriate for a 10-year-old?");
t.judge.autoevals.summarizes(sourceDocument);            // faithful summarization check
t.judge.autoevals.closedQA("Custom scoring rubric description", { on: t.reply });

The { on } option specifies which value to evaluate (defaults to t.reply). The { model } option lets you override the judge model for a single call.

Judge model resolution

The judge model is resolved from most- to least-specific:

Per-call { model } option
  ↓
Per-eval judge.model
  ↓
Global config judge.model

// niceeval.config.ts — global default
defineConfig({ judge: { model: "anthropic/claude-haiku-4-5" } });

// A specific eval that needs a more capable judge
defineEval({
  judge: { model: "anthropic/claude-opus-4-8" },
  async test(t) {
    t.judge.autoevals.factuality(reference);  // uses claude-opus-4-8 for this eval
  },
});

// A single call override
t.judge.autoevals.closedQA("rubric", { on: t.reply, model: "openai/gpt-4o" });

4. Test-as-scoring (sandbox evals)

For sandbox coding evals, the EVAL.ts file is a Vitest test suite that runs inside the sandbox after the agent completes its task. Every test() block in EVAL.ts becomes a gate assertion — pass means gate passes, fail means gate fails.

// evals/fixtures/button/EVAL.ts — not visible to the agent during its run
import { test, expect } from "vitest";
import { existsSync, readFileSync } from "node:fs";

test("Button component exists", () => {
  expect(existsSync("src/components/Button.tsx")).toBe(true);
});

test("accepts label and onClick props", () => {
  const src = readFileSync("src/components/Button.tsx", "utf-8");
  expect(src).toContain("label");
  expect(src).toContain("onClick");
});

You can also assert agent behavior (not just output files) by reading the observability summary that niceeval injects into the sandbox:

test("initialized with scaffold, not hand-written", () => {
  const o11y = JSON.parse(readFileSync("__niceeval__/results.json", "utf-8")).o11y;
  const cmds = o11y.shellCommands.map((c: { command: string }) => c.command);
  expect(cmds.some((c) => c.includes("create-next-app"))).toBe(true);
});

The validation field in the eval controls what gets run: "vitest" runs EVAL.ts plus any configured npm scripts; "none" runs only npm scripts.

5. Efficiency / cost assertions

Token usage is a first-class scoring dimension. An agent that answers correctly but burns far more tokens than expected should not score identically to one that answers efficiently.

t.maxTokens(50_000);            // hard token limit for the entire run (gate by default)
t.maxCost(0.5);                 // estimated cost cap in USD (requires a price table in config)
t.maxTokens(80_000).atLeast(0.7);     // soft variant — tracked but only red under --strict
t.check(t.usage.outputTokens, satisfies((n) => n < 10_000, "not verbose"));

t.usage is available anywhere inside test(t) and exposes { inputTokens, outputTokens, cacheReadTokens?, … }. For sandbox agents, token counts are extracted from the transcript by the adapter; for remote agents, they are returned in Turn.usage.

Custom scorers

A value assertion is just a function (value) => number | Promise<number>. You can write custom matchers using makeAssertion:

import { makeAssertion } from "niceeval/expect";
import type { Assertion } from "niceeval/expect";

function jsonValid(): Assertion {
  return makeAssertion({
    name: "jsonValid",
    severity: "gate",
    score: (value) => {
      try { JSON.parse(String(value)); return 1; }
      catch { return 0; }
    },
  });
}

t.check(t.reply, jsonValid());

Custom matchers compose with the same chain methods as built-ins: .gate(), .atLeast(0.7), .atLeast(threshold).

Evals — how assertions fold into the eval lifecycle and outcome types.
Agents & Adapters — how the standard event stream is produced, which scoped assertions depend on.
Overview — the full architecture and where scoring fits.

​The five scoring mechanisms

1. Value assertions

2. Scoped assertions

3. LLM-as-judge

4. Test-as-scoring

5. Efficiency assertions

​Gate vs soft severity

​Outcome rules

passed

failed

passed

skipped

​1. Value assertions — niceeval/expect matchers

​2. Scoped assertions

​Run / session dimension

​Tool / action dimension

​Event stream dimension (low-level escape hatch)

​Structured output (on turn, not t)

​Workspace dimension (sandbox agents only)

​3. LLM-as-judge

​Judge model resolution

​4. Test-as-scoring (sandbox evals)

​5. Efficiency / cost assertions

​Custom scorers

​Related pages

The five scoring mechanisms

Gate vs soft severity

Outcome rules

1. Value assertions — `niceeval/expect` matchers

2. Scoped assertions

Run / session dimension

Tool / action dimension

Event stream dimension (low-level escape hatch)

Structured output (on `turn`, not `t`)

Workspace dimension (sandbox agents only)

3. LLM-as-judge

Judge model resolution

4. Test-as-scoring (sandbox evals)

5. Efficiency / cost assertions

Custom scorers

Related pages