Assertion type, and all five feed into the same outcome rules.
The five scoring mechanisms
1. Value assertions
t.check(value, matcher) and t.require(value, matcher) evaluate a specific value immediately against a matcher from niceeval/expect. Use these for facts you can verify inline.2. Scoped assertions
t.succeeded(), t.calledTool(), t.messageIncludes(), and friends are registered during test(t) but evaluated after the function returns, against the complete turn data. Use these for whole-run facts.3. LLM-as-judge
t.judge.autoevals.factuality(), t.judge.autoevals.closedQA(), t.judge.autoevals.summarizes(), and t.judge.autoevals.closedQA() ask a separate evaluator model to score open-ended output. The judge model is fully independent from the agent under test.4. Test-as-scoring
For sandbox evals,
EVAL.ts is a Vitest test file that runs inside the sandbox after the agent finishes. Every test() in EVAL.ts becomes a gate assertion. Use this for coding tasks where file content and build results are the ground truth.5. Efficiency assertions
t.maxTokens() and t.maxCost() turn token usage and estimated cost into scoreable dimensions. An agent that answers correctly but burns ten times the expected tokens should not score the same as one that answers efficiently.Gate vs soft severity
Every assertion carries a severity that determines how it influences the final outcome. There are exactly two severities:- gate
- soft
A gate assertion is a hard requirement. If it fails, the entire eval is immediately classified as
failed — regardless of how well every other assertion passed. Use gate for facts that must be true: “the agent called the correct tool”, “the response parsed as valid JSON”, “no shell commands errored.”Most matchers in niceeval/expect (includes, equals, matches, satisfies) default to gate. Scoped assertions like t.succeeded() and t.calledTool() also default to gate.Outcome rules
Once all assertions are collected,outcome.ts folds them into a single outcome in this order:
passed
No errors, all gate assertions passed, all soft assertions met their thresholds.
failed
Execution error or at least one gate assertion did not pass. Hard failure.
passed
All gates passed, but at least one soft fell below its threshold. Quality regression — silent by default, red under
--strict.skipped
t.skip("reason") was called. Excluded from pass-rate calculations entirely.runs > 1), the per-eval summary becomes a pass rate (the fraction of runs that produced passed) and an average latency, rather than a single outcome.
1. Value assertions — niceeval/expect matchers
t.check(value, assertion) evaluates the assertion immediately and records the result. t.require(value, assertion) does the same but throws immediately if the assertion fails, aborting the rest of the test function. Use t.require for preconditions: if a required fact is false, there is no point continuing.
The matchers available from niceeval/expect:
(value) => number — so you can write your own and pass them to t.check without any special registration.
2. Scoped assertions
Scoped assertions are registered duringtest(t) but evaluated after the function returns, against the complete accumulated turn data. They read from the standard event stream and its derived facts — so as long as your adapter produces correct events, these assertions work identically for every agent.
Run / session dimension
Tool / action dimension
input argument to calledTool and notCalledTool supports a small matching language: a plain object performs deep partial matching, a RegExp matches against the serialized input, and a predicate function receives the raw input value.
Event stream dimension (low-level escape hatch)
eventsSatisfy and write an arbitrary predicate over the raw StreamEvent[].
Structured output (on turn, not t)
Workspace dimension (sandbox agents only)
t.sandbox.diff is a queryable object. t.sandbox.diff.get("src/Button.tsx") returns the file’s post-change content; t.sandbox.diff.isEmpty() checks whether any files changed; t.sandbox.diff.matches(re) and t.notInDiff(re) run a regex over the full diff text.
3. LLM-as-judge
Use judge assertions when correctness cannot be expressed as a rule — for open-ended prose, tone, factual consistency, or summarization quality. The evaluator model is entirely separate from the agent under test, preventing self-evaluation bias.{ on } option specifies which value to evaluate (defaults to t.reply). The { model } option lets you override the judge model for a single call.
Judge model resolution
The judge model is resolved from most- to least-specific:4. Test-as-scoring (sandbox evals)
For sandbox coding evals, theEVAL.ts file is a Vitest test suite that runs inside the sandbox after the agent completes its task. Every test() block in EVAL.ts becomes a gate assertion — pass means gate passes, fail means gate fails.
validation field in the eval controls what gets run: "vitest" runs EVAL.ts plus any configured npm scripts; "none" runs only npm scripts.
5. Efficiency / cost assertions
Token usage is a first-class scoring dimension. An agent that answers correctly but burns far more tokens than expected should not score identically to one that answers efficiently.t.usage is available anywhere inside test(t) and exposes { inputTokens, outputTokens, cacheReadTokens?, … }. For sandbox agents, token counts are extracted from the transcript by the adapter; for remote agents, they are returned in Turn.usage.
Custom scorers
A value assertion is just a function(value) => number | Promise<number>. You can write custom matchers using makeAssertion:
.gate(), .atLeast(0.7), .atLeast(threshold).
Related pages
- Evals — how assertions fold into the eval lifecycle and outcome types.
- Agents & Adapters — how the standard event stream is produced, which scoped assertions depend on.
- Overview — the full architecture and where scoring fits.