> ## Documentation Index
> Fetch the complete documentation index at: https://niceeval.com/docs/llms.txt
> Use this file to discover all available pages before exploring further.

# niceeval: TypeScript eval framework for AI agents and LLMs

> niceeval is a TypeScript eval library for AI agents. Evaluate coding agents, HTTP services, and in-process functions with one unified API.

niceeval is a lightweight TypeScript library that brings structured, repeatable evaluation to AI agents. Instead of manually poking at your agent and hoping for the best, you write declarative eval files that describe what a good result looks like — and niceeval takes care of running them, scoring the output, and producing readable reports. Whether you're shipping a Claude Code plugin, a customer-facing HTTP service, or an internal function wrapped in an LLM call, niceeval handles all three with the same `defineEval` API.

## What you can evaluate

niceeval covers three categories of agent under test, all through a unified interface:

<CardGroup cols={3}>
  <Card title="Coding agents" icon="code">
    Drop a CLI agent — Claude Code, Codex, bub, or any compatible tool — into an isolated Docker sandbox, give it a task, and verify the result with real tests and file assertions.
  </Card>

  <Card title="HTTP services & deployed agents" icon="globe">
    Point niceeval at any running HTTP endpoint or deployed agent. Assert on replies, tool calls, and structured output without touching the deployment.
  </Card>

  <Card title="In-process functions" icon="function">
    Call your own TypeScript functions directly inside the eval process. Treat evals as semantic unit tests and run them in CI with zero network overhead.
  </Card>
</CardGroup>

## Why "fast"

The name captures three distinct kinds of speed that matter when you're iterating on agent behavior.

**Fast to author.** Each eval lives in a single file, and its ID is derived automatically from the file path — `evals/weather/brooklyn.eval.ts` becomes `weather/brooklyn`. You write assertions inline in a linear `async test(t)` function with no callbacks and no boilerplate. A one-line `.map` fans a dataset out into dozens of cases.

**Fast to run.** The runner uses bounded concurrency so evals execute in parallel without overwhelming your agent. A fingerprint-based result cache skips cases that already passed. Sandboxes can be reused and pre-warmed between runs. The `earlyExit` flag stops retrying a task the moment one attempt succeeds.

**Fast to read.** Console output streams in real time so you see failures as they happen. Every run produces structured artifacts — an event stream, full transcript, file diffs, and assertion results — in a `.niceeval/<timestamp>/` directory. A unified trace makes it easy to reconstruct exactly what the agent did.

## How niceeval is structured

At a high level, your `evals/` directory is the input and `.niceeval/` artifacts are the output. Everything in between is owned by three collaborating pieces:

```
   your evals/ directory            niceeval core               agent adapter
   ─────────────────────           ──────────────             ─────────────────
   weather.eval.ts   ──discover──>  Runner  ──send──>  Agent  ─── in-process
   sql.eval.ts                        │                        ─── remote HTTP
   fixtures/button/  ──fixture──>     │                        ─── sandbox ──> Docker
     PROMPT.md                        ▼
     EVAL.ts                       Scorers ──> Reporters ──> .niceeval/<run>/
                                   (expect / scoped /         (summary.json /
                                    judge / tests)             event stream /
                                                               transcript / diff)
```

* **niceeval core** owns everything that is the same regardless of what you're evaluating: eval discovery, assertion collection, scoring, concurrency scheduling, caching, reporting, and artifact persistence.
* **Agent adapters** are the open boundary between core and your system under test. Official adapters are included for Claude Code, Codex, and bub; you write your own adapter for any other agent or service.
* **Sandbox** owns the details of running coding agents in isolation — Docker by default, with support for third-party sandbox providers.

<Note>
  niceeval never hardcodes agent-specific logic in its core. It dispatches entirely through the adapter interface, which means adding a new agent type never requires changes to the runner or scorers.
</Note>

## Two integration modes

niceeval supports two top-level integration modes depending on whether the agent under test needs an isolated workspace.

**Sandbox mode** — for coding agents like Codex and Claude Code that must operate on a real filesystem:

```
   evals/*.eval.ts
        │
        ▼
   ┌─────────────────────┐
   │     niceeval core    │
   │ discover·schedule·  │
   │    score·report     │
   └─────────────────────┘
        │  Agent adapter (official)
        ▼
   ┌──────────────────────────────┐
   │         Docker Sandbox        │
   │   ┌────────────────────────┐  │
   │   │  Codex / Claude Code / │  │
   │   │  apps needing isolation│  │
   │   └────────────────────────┘  │
   └──────────────────────────────┘
```

**Direct mode** — for HTTP services and in-process functions that don't need Docker:

```
   evals/*.eval.ts
        │
        ▼
   ┌─────────────────────┐
   │     niceeval core    │
   │ discover·schedule·  │
   │    score·report     │
   └─────────────────────┘
        │  Agent adapter (official or custom)
        ▼
   ┌──────────────────────────────┐
   │       your own Web Agent      │
   │  (HTTP / AI SDK · LangGraph · │
   │   and other frameworks —      │
   │        no Docker needed)      │
   └──────────────────────────────┘
```

## What comes next

The [Quickstart](/quickstart) walks you through installing niceeval, scaffolding your project with `npx niceeval init`, and running all three eval types — function, conversational, and coding-agent — in under ten minutes.

If you want to go deeper right away, the [Installation](/installation) page covers prerequisites, configuration options, and environment variables in full detail.
