defineEval is the primary building block for writing evals in niceeval. You call it once per file, pass a configuration object describing what to test, and export the result as the default export. The runner discovers your file, derives its ID from the file path, and executes the test function under the selected experiment’s agent.
You must not provide an
id or name field. niceeval derives the eval ID
from the file path: evals/weather/brooklyn.eval.ts becomes weather/brooklyn.
Rename the file to rename the ID — it never goes stale.defineEval options
A human-readable label for this eval. Appears in console output and reports.
Does not affect the ID — that comes from the file path.
An array of tag strings used to filter evals via
--tag on the CLI. Tags let
you group related evals (e.g. "billing", "regression", "slow") without
changing the directory structure.Overrides the judge model for this specific eval. Takes precedence over the
global
judge.model set in defineConfig. The model field accepts any
model string understood by your judge backend.A list of reporters applied only to this eval, in addition to (or instead of)
the globally configured reporters. Useful when a specific eval needs
specialized output formats.
Per-eval timeout in milliseconds. Overrides the global
timeoutMs in
defineConfig. When the timeout elapses, the eval is marked failed with
error: timeout and the runner moves on.Arbitrary key-value pairs attached to this eval’s result record. Useful for
downstream analysis, custom reporters, or dashboard annotations.
The async function that drives the agent and asserts results. Receives the
test context
t (see below). All assertions, sends, and judge calls live here.The test context: t
The t object is assembled by the runner based on the capabilities declared by the agent adapter. You get a different set of methods depending on what the agent supports. At the TypeScript level, methods that require a capability your agent hasn’t declared are simply not present on t — so misconfiguration shows up at compile time, not at runtime.
Always available
These methods are available regardless of which agent or capabilities are configured.Sends a message to the agent and waits for the response. Returns a
Turn
object (see below). Each call to t.send counts as one turn; call it
multiple times for multi-turn conversations (requires conversation
capability).Evaluates
assertion against value immediately and records the result. On
failure, the eval is marked according to the assertion’s severity (gate
→ failed, soft → passed). Does not throw — execution continues.Like
t.check, but throws immediately if the assertion fails, aborting
the rest of the test. Use this for preconditions where continuing would be
meaningless.Writes a diagnostic message to the eval’s output log. Appears in
.niceeval/
artifacts and in verbose console output. Useful for debugging flaky evals.Marks this eval as
skipped with the given reason and stops execution. Use
when a prerequisite isn’t available (e.g. a required environment variable is
missing).The feature flags passed down from the experiment configuration. Read these
inside your
test to branch on experiment variables.The model tier string that was passed to the agent for this run. Read-only.
Useful for logging or conditional assertions.
The
AbortSignal for this eval’s lifetime. Forwarded from the runner’s
timeout and early-exit logic. Pass it into any custom async work you do inside
test.With conversation capability
Available when the agent declares capabilities: { conversation: true }.
The text of the most recent assistant message across all turns. Equivalent to
reading the last
message event with role: "assistant". Shorthand for the
common pattern of checking the final response.Signals the runner to start a fresh conversation session for subsequent
t.send calls. The current session’s history is discarded. Useful when you
need to test multiple independent conversation threads within a single eval.With toolObservability capability
Available when the agent declares capabilities: { toolObservability: true }. All of these are scope-level assertions — they are evaluated after the test function returns, reading from the accumulated standard event stream.
Asserts the agent called the named tool. Optional
opts narrow the match:input— partial/deep match against call arguments (literal, regex against serialized form, or predicate)count— exact number of times the tool was calledstatus— filter by call outcome ("completed"|"failed")
Asserts the agent did not call the named tool (with the given input, if
provided). Accepts the same
opts as t.calledTool.Asserts the listed tools were called in the given relative order. Other tools
may appear between them.
Asserts the agent made zero tool calls during this run. Useful for verifying
lightweight responses that should not invoke any external actions.
Asserts the total number of tool calls was at most
n.Syntactic sugar for
t.calledTool("load_skill", { input: { skill: skillName } }).
Asserts the agent loaded the named skill.Asserts the agent delegated to a sub-agent with the given name.
opts may
include remoteUrl (string or RegExp) and output matchers.Asserts that none of the tool calls, sub-agent calls, or skill loads ended
with
status: "failed".Asserts a specific event type appears in the raw event stream.
opts may
include count and data matchers.Asserts that the given event type does not appear anywhere in the event
stream.
Asserts event types appear in the given relative order in the stream.
t.eventsSatisfy(label, predicate)
(label: string, predicate: (events: StreamEvent[]) => boolean) => void
Escape hatch for custom event-stream assertions. Receives the full raw event
array and must return a boolean.
With workspace (sandbox) capability
Available for sandbox agents (those using defineSandboxAgent).
Asserts the agent modified the file at the given workspace-relative path.
Derived from
git diff HEAD after the agent run.Asserts the agent deleted the file at the given path.
A queryable view of all changes the agent made to the workspace:
t.sandbox.diff.get(path)— returns the post-run content of the file atpatht.sandbox.diff.isEmpty()— asserts no files were changedt.sandbox.diff.matches(re)— asserts the full diff text matches a regext.notInDiff(re)— asserts the diff does not match the regex (useful for detecting leaked secrets or banned patterns)
Use this matcher with
t.check(await t.sandbox.runCommand(...), commandSucceeded())
to assert that a verification command exited with code 0.Asserts a specific npm script (e.g.
"build", "lint") exited with code 0.Judge assertions
Available on any eval that has a judge model configured (globally or per-eval).Uses the judge model to score factual consistency between the agent’s reply
and the
expected reference text. Returns a JudgeAssertion with
.atLeast(threshold) for setting a soft threshold.Asks the judge model a yes/no question about the reply. Returns a score
between 0 and 1. Use
.atLeast(threshold) to set a minimum passing score.Asks the judge whether the agent’s reply faithfully summarizes the
source
text.Free-form scoring against a custom rubric string. Useful when none of the
built-in judge methods fit your evaluation criteria.
opts accepted by all judge methods:on— the value to evaluate (defaults tot.reply)model— overrides the judge model for this single call
Efficiency assertions
Asserts the total token usage (input + output) for this run did not exceed
n. Defaults to gate severity. Chain .atLeast(0.7) to downgrade.Asserts the estimated cost of this run (based on a price table) did not exceed
usd dollars.The Turn return type
await t.send(text) returns a Turn object. It is immutable and carries everything the agent produced for that one round-trip.
The raw standard event stream produced by the agent for this turn. All
scope-level assertions read from this. See
defineAgent reference for the
full StreamEvent union type.Structured output returned by the agent (e.g. a parsed JSON object). Used by
turn.outputEquals() and turn.outputMatches().The outcome of this turn.
"waiting" means the agent stopped at a
human-in-the-loop (input.requested) prompt.Convenience field: the concatenated text of all
message events with
role: "assistant" in this turn. Derived from events.Convenience field: the list of tool calls made during this turn, each with
name, input, output, and status. Derived from events.Token usage for this specific turn (if reported by the agent).
Turn methods
Asserts that
turn.data deeply equals expected. Equivalent to
t.check(turn.data, equals(expected)) but scoped to the turn object.Validates
turn.data against a Standard Schema (e.g. a Zod schema).defineAgentEval
defineAgentEval is a convenience wrapper for coding-agent evals that you want to define programmatically rather than via a fixture directory. It is equivalent to a fixture (PROMPT.md + EVAL.ts) but expressed entirely in TypeScript.
Human-readable description. Appears in reports.
The task prompt sent to the coding agent. Equivalent to the contents of
PROMPT.md in a fixture directory.Path to a local directory whose contents are uploaded to the sandbox as the
agent’s starting workspace. Equivalent to the non-test files in a fixture
directory.
The assertion function. Receives the full test context
t. In addition to all
standard t methods, t.run() is available to trigger the agent run
explicitly.Drives the sandbox agent with the configured
prompt and files. You
must call this before asserting any workspace state (diff, files, tests).Dataset export (fan-out)
When a single eval file exports an array as its default export, niceeval fans it out into one eval per element. This is the canonical way to write parameterized test suites.| Array index | Generated ID |
|---|---|
| 0 | sql/0000 |
| 1 | sql/0001 |
| 12 | sql/0012 |
loadJson from niceeval/loaders works the same way as loadYaml.