> ## Documentation Index > Fetch the complete documentation index at: https://niceeval.com/docs/llms.txt > Use this file to discover all available pages before exploring further. # defineEval: declare, configure, and run evals in niceeval > Complete reference for defineEval and defineAgentEval. Covers all options, the test context t, Turn return value, and dataset array exports. `defineEval` is the primary building block for writing evals in niceeval. You call it once per file, pass a configuration object describing what to test, and export the result as the default export. The runner discovers your file, derives its ID from the file path, and executes the `test` function under the selected experiment's agent. ```ts theme={null} import { defineEval } from "niceeval"; export default defineEval({ description: "Brooklyn weather query", async test(t) { await t.send("What's the weather like in Brooklyn today?"); t.succeeded(); t.calledTool("get_weather", { input: { city: "Brooklyn" }, count: 1 }); t.check(t.reply, includes("sunny")); }, }); ``` You must **not** provide an `id` or `name` field. niceeval derives the eval ID from the file path: `evals/weather/brooklyn.eval.ts` becomes `weather/brooklyn`. Rename the file to rename the ID — it never goes stale. *** ## defineEval options A human-readable label for this eval. Appears in console output and reports. Does not affect the ID — that comes from the file path. An array of tag strings used to filter evals via `--tag` on the CLI. Tags let you group related evals (e.g. `"billing"`, `"regression"`, `"slow"`) without changing the directory structure. ```ts theme={null} tags: ["billing", "regression"], ``` Overrides the judge model for this specific eval. Takes precedence over the global `judge.model` set in `defineConfig`. The `model` field accepts any model string understood by your judge backend. ```ts theme={null} judge: { model: "anthropic/claude-opus-4-8" }, ``` A list of reporters applied only to this eval, in addition to (or instead of) the globally configured reporters. Useful when a specific eval needs specialized output formats. Per-eval timeout in milliseconds. Overrides the global `timeoutMs` in `defineConfig`. When the timeout elapses, the eval is marked `failed` with `error: timeout` and the runner moves on. Arbitrary key-value pairs attached to this eval's result record. Useful for downstream analysis, custom reporters, or dashboard annotations. ```ts theme={null} metadata: { owner: "platform-team", jira: "PLAT-1234" }, ``` The async function that drives the agent and asserts results. Receives the test context `t` (see below). All assertions, sends, and judge calls live here. ```ts theme={null} async test(t) { const turn = await t.send("Summarize this document."); t.judge.autoevals.summarizes(document).atLeast(0.8); t.check(turn.message, includes("key finding")); }, ``` *** ## The test context: `t` The `t` object is assembled by the runner based on the **capabilities** declared by the agent adapter. You get a different set of methods depending on what the agent supports. At the TypeScript level, methods that require a capability your agent hasn't declared are simply not present on `t` — so misconfiguration shows up at compile time, not at runtime. ### Always available These methods are available regardless of which agent or capabilities are configured. Sends a message to the agent and waits for the response. Returns a `Turn` object (see below). Each call to `t.send` counts as one turn; call it multiple times for multi-turn conversations (requires `conversation` capability). ```ts theme={null} const turn = await t.send("What is the capital of France?"); ``` Evaluates `assertion` against `value` immediately and records the result. On failure, the eval is marked according to the assertion's severity (`gate` → `failed`, `soft` → `passed`). Does **not** throw — execution continues. ```ts theme={null} t.check(t.reply, includes("Paris")); t.check(turn.data, equals({ intent: "refund" })); ``` Like `t.check`, but **throws immediately** if the assertion fails, aborting the rest of the test. Use this for preconditions where continuing would be meaningless. ```ts theme={null} t.require(turn.status, equals("completed")); // abort if the agent errored ``` Writes a diagnostic message to the eval's output log. Appears in `.niceeval/` artifacts and in verbose console output. Useful for debugging flaky evals. Marks this eval as `skipped` with the given reason and stops execution. Use when a prerequisite isn't available (e.g. a required environment variable is missing). ```ts theme={null} if (!process.env.OPENAI_API_KEY) t.skip("OPENAI_API_KEY not set"); ``` The feature flags passed down from the experiment configuration. Read these inside your `test` to branch on experiment variables. ```ts theme={null} if (t.flags.useExtendedPrompt) { /* ... */ } ``` The model tier string that was passed to the agent for this run. Read-only. Useful for logging or conditional assertions. The `AbortSignal` for this eval's lifetime. Forwarded from the runner's timeout and early-exit logic. Pass it into any custom async work you do inside `test`. *** ### With `conversation` capability Available when the agent declares `capabilities: { conversation: true }`. The text of the most recent assistant message across all turns. Equivalent to reading the last `message` event with `role: "assistant"`. Shorthand for the common pattern of checking the final response. ```ts theme={null} t.check(t.reply, includes("confirmed")); ``` Signals the runner to start a fresh conversation session for subsequent `t.send` calls. The current session's history is discarded. Useful when you need to test multiple independent conversation threads within a single eval. ```ts theme={null} await t.send("Hello, remember my name is Alice."); t.newSession(); const turn = await t.send("What is my name?"); // The agent should not remember — it's a new session. ``` *** ### With `toolObservability` capability Available when the agent declares `capabilities: { toolObservability: true }`. All of these are **scope-level assertions** — they are evaluated after the `test` function returns, reading from the accumulated standard event stream. Asserts the agent called the named tool. Optional `opts` narrow the match: * `input` — partial/deep match against call arguments (literal, regex against serialized form, or predicate) * `count` — exact number of times the tool was called * `status` — filter by call outcome (`"completed"` | `"failed"`) ```ts theme={null} t.calledTool("get_weather", { input: { city: "Brooklyn" }, count: 1 }); ``` Asserts the agent did **not** call the named tool (with the given input, if provided). Accepts the same `opts` as `t.calledTool`. ```ts theme={null} t.notCalledTool("shell", { input: { command: /npm i/ } }); ``` Asserts the listed tools were called in the given relative order. Other tools may appear between them. ```ts theme={null} t.toolOrder(["read_file", "write_file"]); ``` Asserts the agent made zero tool calls during this run. Useful for verifying lightweight responses that should not invoke any external actions. Asserts the total number of tool calls was at most `n`. ```ts theme={null} t.maxToolCalls(5); ``` Syntactic sugar for `t.calledTool("load_skill", { input: { skill: skillName } })`. Asserts the agent loaded the named skill. ```ts theme={null} t.loadedSkill("memory-v2"); ``` Asserts the agent delegated to a sub-agent with the given name. `opts` may include `remoteUrl` (string or RegExp) and output matchers. ```ts theme={null} t.calledSubagent("researcher", { remoteUrl: /api\.example/ }); ``` Asserts that none of the tool calls, sub-agent calls, or skill loads ended with `status: "failed"`. Asserts a specific event type appears in the raw event stream. `opts` may include `count` and data matchers. ```ts theme={null} t.event("input.requested", { count: 1 }); ``` Asserts that the given event type does **not** appear anywhere in the event stream. ```ts theme={null} t.notEvent("error"); ``` Asserts event types appear in the given relative order in the stream. ```ts theme={null} t.eventOrder(["action.called", "subagent.called"]); ``` Escape hatch for custom event-stream assertions. Receives the full raw event array and must return a boolean. ```ts theme={null} t.eventsSatisfy("reads before writes", (events) => { const readIdx = events.findIndex(e => e.type === "action.called" && e.name === "read_file"); const writeIdx = events.findIndex(e => e.type === "action.called" && e.name === "write_file"); return readIdx < writeIdx; }); ``` *** ### With `workspace` (sandbox) capability Available for sandbox agents (those using `defineSandboxAgent`). Asserts the agent modified the file at the given workspace-relative path. Derived from `git diff HEAD` after the agent run. Asserts the agent deleted the file at the given path. A queryable view of all changes the agent made to the workspace: * `t.sandbox.diff.get(path)` — returns the post-run content of the file at `path` * `t.sandbox.diff.isEmpty()` — asserts no files were changed * `t.sandbox.diff.matches(re)` — asserts the full diff text matches a regex * `t.notInDiff(re)` — asserts the diff does **not** match the regex (useful for detecting leaked secrets or banned patterns) ```ts theme={null} t.check(t.sandbox.diff.get("src/Button.tsx"), includes("onClick")); t.notInDiff(/sk-[A-Za-z0-9]/); // no API keys in diff ``` Use this matcher with `t.check(await t.sandbox.runCommand(...), commandSucceeded())` to assert that a verification command exited with code 0. Asserts a specific npm script (e.g. `"build"`, `"lint"`) exited with code 0. ```ts theme={null} t.check(await t.sandbox.runCommand("npm", ["run", "build"], { cwd: "/workspace" }), commandSucceeded()); ``` *** ### Judge assertions Available on any eval that has a judge model configured (globally or per-eval). Uses the judge model to score factual consistency between the agent's reply and the `expected` reference text. Returns a `JudgeAssertion` with `.atLeast(threshold)` for setting a soft threshold. ```ts theme={null} t.judge.autoevals.factuality("Paris is the capital of France.").atLeast(0.8); ``` Asks the judge model a yes/no question about the reply. Returns a score between 0 and 1. Use `.atLeast(threshold)` to set a minimum passing score. ```ts theme={null} t.judge.autoevals.closedQA("Is the tone professional and on-topic?").atLeast(0.7); ``` Asks the judge whether the agent's reply faithfully summarizes the `source` text. Free-form scoring against a custom rubric string. Useful when none of the built-in judge methods fit your evaluation criteria. ```ts theme={null} t.judge.autoevals.closedQA("Does the answer use simple language suitable for a 10-year-old?", { on: turn.message, }).atLeast(0.6); ``` **`opts` accepted by all judge methods:** * `on` — the value to evaluate (defaults to `t.reply`) * `model` — overrides the judge model for this single call *** ### Efficiency assertions Asserts the total token usage (input + output) for this run did not exceed `n`. Defaults to `gate` severity. Chain `.atLeast(0.7)` to downgrade. ```ts theme={null} t.maxTokens(50_000); // gate: fails if over t.maxTokens(80_000).atLeast(0.7); // soft: passed but not failed ``` Asserts the estimated cost of this run (based on a price table) did not exceed `usd` dollars. ```ts theme={null} t.maxCost(0.50); // fail if over $0.50 ``` The accumulated token usage for the current run. Available to read at any point inside `test`. ```ts theme={null} t.check(t.usage.outputTokens, satisfies(n => n < 10_000, "output not verbose")); ``` **Fields:** Total prompt tokens consumed. Total completion tokens produced. Tokens served from prompt cache (if available). *** ## The Turn return type `await t.send(text)` returns a `Turn` object. It is immutable and carries everything the agent produced for that one round-trip. The raw standard event stream produced by the agent for this turn. All scope-level assertions read from this. See `defineAgent` reference for the full `StreamEvent` union type. Structured output returned by the agent (e.g. a parsed JSON object). Used by `turn.outputEquals()` and `turn.outputMatches()`. The outcome of this turn. `"waiting"` means the agent stopped at a human-in-the-loop (`input.requested`) prompt. Convenience field: the concatenated text of all `message` events with `role: "assistant"` in this turn. Derived from `events`. Convenience field: the list of tool calls made during this turn, each with `name`, `input`, `output`, and `status`. Derived from `events`. Token usage for this specific turn (if reported by the agent). ### Turn methods Asserts that `turn.data` deeply equals `expected`. Equivalent to `t.check(turn.data, equals(expected))` but scoped to the turn object. ```ts theme={null} const turn = await t.send("Return the intent as JSON."); turn.outputEquals({ intent: "refund" }); ``` Validates `turn.data` against a Standard Schema (e.g. a Zod schema). ```ts theme={null} import { z } from "zod"; turn.outputMatches(z.object({ intent: z.enum(["refund", "ship"]) })); ``` *** ## defineAgentEval `defineAgentEval` is a convenience wrapper for **coding-agent evals** that you want to define programmatically rather than via a fixture directory. It is equivalent to a fixture (`PROMPT.md` + `EVAL.ts`) but expressed entirely in TypeScript. ```ts theme={null} import { defineAgentEval } from "niceeval"; import { includes } from "niceeval/expect"; export default defineAgentEval({ description: "Rewrite callbacks to async/await", prompt: "Rewrite all callbacks in src/legacy.js to async/await, preserving behavior.", files: "./fixtures/legacy-callbacks", // workspace starter files async test(t) { await t.run(); // drives the sandbox agent t.fileChanged("src/legacy.js"); t.check(t.sandbox.diff.get("src/legacy.js"), includes("await")); t.check(await t.sandbox.runCommand("npm", ["test"], { cwd: "/workspace" }), commandSucceeded()); }, }); ``` Human-readable description. Appears in reports. The task prompt sent to the coding agent. Equivalent to the contents of `PROMPT.md` in a fixture directory. Path to a local directory whose contents are uploaded to the sandbox as the agent's starting workspace. Equivalent to the non-test files in a fixture directory. The assertion function. Receives the full test context `t`. In addition to all standard `t` methods, `t.run()` is available to trigger the agent run explicitly. Drives the sandbox agent with the configured `prompt` and `files`. You must call this before asserting any workspace state (diff, files, tests). *** ## Dataset export (fan-out) When a single eval file exports an **array** as its default export, niceeval fans it out into one eval per element. This is the canonical way to write parameterized test suites. ```ts theme={null} // evals/sql.eval.ts import { defineEval } from "niceeval"; import { loadYaml } from "niceeval/loaders"; import { equals } from "niceeval/expect"; const doc = await loadYaml("evals/data/sql-cases.yaml"); const rows = doc.cases as { task: string; prompt: string; sql: string }[]; export default rows.map((row) => defineEval({ description: row.task, async test(t) { await t.send(row.prompt); t.succeeded(); t.check(t.reply, equals(row.sql)); }, }), ); ``` **Generated IDs** use the file path as a prefix plus a zero-padded 4-digit index: | Array index | Generated ID | | ----------- | ------------ | | 0 | `sql/0000` | | 1 | `sql/0001` | | 12 | `sql/0012` | IDs are stable as long as the array order doesn't change, making them safe to reference in CI history, dashboards, and issue trackers. `loadJson` from `niceeval/loaders` works the same way as `loadYaml`.