> ## Documentation Index
> Fetch the complete documentation index at: https://niceeval.com/docs/llms.txt
> Use this file to discover all available pages before exploring further.

# Evals in niceeval: lifecycle, outcomes, and eval files

> An eval is a single test case: a description, an agent reference, and a test function. Learn how evals are discovered, scheduled, passed, and reported.

An eval is the atomic unit of everything niceeval does. At its simplest, it is a TypeScript file that says: "send this input to this agent, and verify that what comes back meets these conditions." Everything else — discovery, concurrency, caching, reporting, artifact persistence — exists to run evals reliably and give you clear answers about whether your agent is behaving correctly.

## Anatomy of an eval

Every eval is created with `defineEval` and has three essential parts: a human-readable **description** that appears in reports, an **agent** reference that names the subject under test, and an `async test(t)` **function** where you express what success looks like.

```ts theme={null}
// evals/weather/brooklyn.eval.ts
import { defineEval } from "niceeval";
import { includes } from "niceeval/expect";

export default defineEval({
  description: "Brooklyn weather query",
  async test(t) {
    await t.send("What is the weather in Brooklyn today?");
    t.succeeded();                                          // scoped assertion
    t.calledTool("get_weather", { input: { city: "Brooklyn" } });
    t.check(t.reply, includes("sunny"));                   // value assertion
  },
});
```

The full set of fields `defineEval` accepts:

```ts theme={null}
defineEval({
  description?: string;            // shown in reports
  agent?: string;                  // optional eval-local default; normal runs select agent via experiment
  tags?: string[];                 // for --tag filtering on the CLI
  judge?: JudgeConfig;             // override the default judge model for this eval
  reporters?: Reporter[];          // eval-specific reporters
  timeoutMs?: number;              // override the default timeout
  metadata?: Record<string, unknown>;
  async test(t) { /* interactions and assertions */ },
});
```

<Note>
  You must **not** provide an `id` or `name` field. niceeval derives the ID automatically from the file path.
</Note>

## Path as identity

The file path of an eval **is** its ID. niceeval strips the `evals/` prefix and the `.eval.ts` suffix to produce a stable, human-readable identifier:

| File path                                    | Eval ID            |
| -------------------------------------------- | ------------------ |
| `evals/weather/brooklyn.eval.ts`             | `weather/brooklyn` |
| `evals/sql.eval.ts`                          | `sql`              |
| `evals/fixtures/button/` (fixture directory) | `fixtures/button`  |

This convention has an important consequence: **renaming the file changes the eval's ID**. There is no hidden registry to keep in sync. The path is always the truth, and cached results, CI history, and experiment records all use the same path-derived key.

You use the ID to filter which evals to run from the CLI:

```shell theme={null}
npx niceeval exp local weather          # runs all evals whose ID starts with "weather"
npx niceeval exp local weather/brooklyn # runs exactly this one eval
```

## The eval lifecycle

<Steps>
  <Step title="Discovery">
    When you run `npx niceeval exp ...`, the runner recursively scans the `evals/` directory for files ending in `.eval.ts` and for fixture directories (those containing a `PROMPT.md`). Default exports of `defineEval(...)` or an array of `defineEval(...)` calls are registered as individual evals.
  </Step>

  <Step title="Scheduling">
    The runner dispatches evals up to `maxConcurrency` at a time. Before dispatching an eval, it checks the fingerprint cache: if the eval's source, its inputs, and the active agent haven't changed since the last passing run, the cached result is replayed and the eval is skipped — saving both time and API cost.
  </Step>

  <Step title="agent.send">
    For each eval, the runner calls `agent.send(input, ctx)` with the text from your first `t.send(...)` call. The adapter drives the subject under test and returns a `Turn` containing the standard event stream. Multi-turn evals call `agent.send` once per `await t.send(...)`.
  </Step>

  <Step title="Scoring">
    Once your `test(t)` function completes, the core evaluates every registered assertion — value assertions (`t.check`, `t.require`), scoped assertions (`t.succeeded()`, `t.calledTool()`, etc.), and LLM-as-judge calls — against the collected turn data.
  </Step>

  <Step title="Outcome">
    All assertion results are folded into a single outcome by `outcome.ts`. The rules are deterministic and described in full in the [Scoring](/concepts/scoring) page.
  </Step>

  <Step title="Report">
    The outcome and all assertion details stream to the console in real time. When the full run finishes, reporters write structured output (JUnit, JSON, etc.) and the `.niceeval/<run>/` directory is populated with artifacts: `summary.json`, per-eval results, the event stream, transcript, generated-file diffs, and test output.
  </Step>
</Steps>

## Outcome types

Each eval ends with exactly one outcome. Understanding what each outcome means helps you interpret run output and configure CI thresholds correctly.

<CardGroup cols={2}>
  <Card title="passed" icon="circle-check" color="#22c55e">
    No execution errors. All gate assertions passed. All soft assertions met their thresholds. This is the unambiguous success state.
  </Card>

  <Card title="failed" icon="circle-xmark" color="#ef4444">
    An execution error occurred (timeout, thrown exception, author mistake), **or** at least one gate assertion did not pass. A failed eval is a hard signal that something is broken.
  </Card>

  <Card title="passed" icon="chart-bar" color="#f59e0b">
    All gate assertions passed, but at least one soft assertion fell below its threshold. This means "usable but there is a quality regression." Scored evals do not fail the run by default — only under `--strict`.
  </Card>

  <Card title="skipped" icon="forward" color="#6b7280">
    Your `test(t)` function called `t.skip("reason")`, signaling that a prerequisite was missing or the eval does not apply to the current configuration. Skipped evals are excluded from pass-rate calculations.
  </Card>
</CardGroup>

When you run an eval more than once (`runs > 1`), the summary for that eval becomes a **pass rate** (percentage of runs that produced `passed`) and an average latency, rather than a single outcome.

## Gate vs soft assertions (brief introduction)

Every assertion carries a **severity** that determines how it affects the outcome:

* **Gate** assertions are hard requirements. If a gate assertion fails, the entire eval is `failed`. Use gate for facts that must be true: "the agent called the right tool", "the output parsed as valid JSON", "no shell commands failed."
* **Soft** assertions are quality scores with a numeric threshold. If a soft assertion's score falls below its threshold, the eval becomes `passed` rather than `passed`. Use soft for continuous judgments: similarity scoring, LLM-as-judge factuality, efficiency budgets you want to track without blocking CI.

Matchers from `niceeval/expect` carry sensible defaults (`includes` and `equals` default to gate; `similarity` and judge calls default to soft), and you can override the severity with a chain method:

```ts theme={null}
t.check(t.reply, includes("confirmed"));          // gate by default
t.check(t.reply, similarity(expected).gate());    // promote similarity to gate
t.judge.autoevals.factuality(reference).atLeast(0.8);       // soft with explicit threshold
```

Full details on all matchers, scoped assertions, LLM-as-judge, and the outcome folding rules are on the [Scoring](/concepts/scoring) page.

## The `*.eval.ts` naming convention

niceeval discovers evals by scanning for files that match the `*.eval.ts` glob. A few conventions help keep a large eval suite organized:

* Files must end in `.eval.ts` to be discovered. Any other `.ts` file in `evals/` is ignored.
* Use subdirectories to group related evals. `evals/billing/refund.eval.ts` produces ID `billing/refund`.
* Dataset files and helper utilities live alongside eval files but do not match `*.eval.ts`, so they are never mistakenly treated as evals.
* Fixture directories are discovered separately by the presence of `PROMPT.md`, not by filename pattern.

## Array exports and dataset fan-out

When a `*.eval.ts` file's default export is an **array** of `defineEval(...)` calls, niceeval registers each element as a separate eval. This is the canonical way to evaluate an agent against a dataset:

```ts theme={null}
// evals/sql.eval.ts
import { defineEval } from "niceeval";
import { loadYaml } from "niceeval/loaders";
import { equals } from "niceeval/expect";

const doc = await loadYaml("evals/data/sql-cases.yaml");
const rows = doc.cases as { task: string; prompt: string; sql: string }[];

export default rows.map((row) =>
  defineEval({
    description: row.task,
    async test(t) {
      await t.send(row.prompt);
      t.succeeded();
      t.check(t.reply, equals(row.sql));
    },
  }),
);
```

IDs for array-exported evals are generated as `<file-id>/<zero-padded-index>` — for example `sql/0000`, `sql/0001` — so they are stable and sortable regardless of dataset order changes.

<Tip>
  Dataset fan-out is the fastest way to go from a spreadsheet of expected inputs and outputs to a full eval suite. One `.map()` call can produce dozens or hundreds of test cases that run concurrently and report individually.
</Tip>

## Related pages

* [Agents & Adapters](/concepts/agents-adapters) — how the `agent` field connects your eval to a subject under test.
* [Scoring](/concepts/scoring) — the complete assertion vocabulary and outcome rules.
* [Overview](/concepts/overview) — the full architecture diagram and layer responsibilities.