> ## Documentation Index
> Fetch the complete documentation index at: https://niceeval.com/docs/llms.txt
> Use this file to discover all available pages before exploring further.

# niceeval quickstart: run your first eval in 10 minutes

> Install niceeval, scaffold your project, and run your first three evals — function, conversational, and coding-agent — in under 10 minutes.

This guide takes you from a blank project to three working evals: one that calls an in-process function, one that drives a conversational agent over HTTP, and one that drops a coding agent into a Docker sandbox to write real code. By the end you'll have a runnable eval suite with CI-ready output.

If you already know what you want to evaluate, jump straight to the relevant example:

<CardGroup cols={3}>
  <Card title="Eval a Claude Code / Codex plugin" icon="plug" href="/example/claude-code-codex-plugin">
    For plugins, MCP servers, and project-level coding-agent extensions.
  </Card>

  <Card title="Eval a Claude Code / Codex Skill" icon="wand-magic-sparkles" href="/example/claude-code-codex-skill">
    Verify that a Skill is triggered, follows its prescribed flow, and actually improves task success rate.
  </Card>

  <Card title="Eval an AI agent application" icon="globe" href="/example/ai-agent-application">
    For HTTP agents, AI SDK, LangGraph, or any custom agent service.
  </Card>
</CardGroup>

<Steps>
  ### Install niceeval

  Add niceeval as a dev dependency, then run the scaffold command:

  ```shell theme={null}
  npm install -D niceeval
  npx niceeval init
  ```

  `npx niceeval init` reads your project layout and generates everything you need to run your first eval immediately.

  ### Explore the generated files

  After `init` completes, your project contains the following new entries:

  ```
  your-project/
  ├─ niceeval.config.ts
  └─ evals/
     ├─ hello.eval.ts            # example: conversational eval
     └─ fixtures/
        └─ button/               # example: sandbox coding-agent eval
           ├─ PROMPT.md
           ├─ EVAL.ts
           └─ package.json
  ```

  `niceeval.config.ts` is your central configuration — it sets the judge model, reporters, concurrency, timeout, and sandbox backend. `evals/` is where all your eval files live; the file path automatically becomes the eval's ID.

  <Note>
    You can safely delete or replace the generated example files once you've read through them. They're there to illustrate the shape of each eval type, not to run against a real agent.
  </Note>

  ### Configure your project

  Open `niceeval.config.ts` and review the defaults:

  ```ts theme={null}
  // niceeval.config.ts
  import { defineConfig } from "niceeval";
  import { Console, JUnit } from "niceeval/reporters";

  export default defineConfig({
    judge: { model: "anthropic/claude-haiku-4-5" },
    reporters: [Console(), JUnit(".niceeval/junit.xml")],
    maxConcurrency: 8,
    timeoutMs: 300_000,
    sandbox: "auto",  // uses cloud token if available, otherwise Docker
  });
  ```

  Set `ANTHROPIC_API_KEY` (for Claude Code or the judge model) or `OPENAI_API_KEY` (for Codex) in your environment before proceeding.
</Steps>

## Eval 1 — an in-process function

The fastest way to start is evaluating a TypeScript function that lives in your own codebase. You define an agent adapter that calls your function directly, then write an eval that checks the output. No network, no Docker.

First, create the agent adapter that wraps your function:

```ts theme={null}
// agents/classify.ts
import { defineAgent } from "niceeval/adapter";
import { classifyIntent } from "../src/agent.js"; // your own code

export default defineAgent({
  name: "classify",
  async send(input) {
    return { data: await classifyIntent(input.text), status: "completed" };
  },
});
```

Then write the eval file:

```ts theme={null}
// evals/classify.eval.ts
import { defineEval } from "niceeval";
import { equals } from "niceeval/expect";

export default defineEval({
  description: "Intent classification: refund request",
  async test(t) {
    const turn = await t.send("I'd like to return my order and get a refund.");
    t.check(turn.data, equals({ intent: "refund" }));
  },
});
```

Add the `classify` agent to an experiment file such as `experiments/local.ts`, then run:

```shell theme={null}
npx niceeval exp local classify
```

<Tip>
  In-process evals are ideal for semantic regression testing. Because they call your code directly, they run in milliseconds and slot naturally into a standard CI pipeline alongside unit tests.
</Tip>

## Eval 2 — a conversational agent

To eval an agent that lives behind an HTTP endpoint, you write an adapter that handles the request/response cycle. The URL and any credentials are the adapter's concern — niceeval has no `--url` flag and imposes no protocol requirements.

Define the remote agent adapter:

```ts theme={null}
// agents/weather-bot.ts
import { defineAgent } from "niceeval/adapter";

export default defineAgent({
  name: "weather-bot",
  capabilities: { conversation: true, toolObservability: true },
  async send(input, ctx) {
    const r = await fetch(`${process.env.AGENT_URL}/chat`, {
      method: "POST",
      body: JSON.stringify({ message: input.text }),
      signal: ctx.signal,
    });
    const body = await r.json();
    return { message: body.reply, toolCalls: body.tools, status: "completed" };
  },
});
```

Write the eval:

```ts theme={null}
// evals/weather/brooklyn.eval.ts
import { defineEval } from "niceeval";
import { includes } from "niceeval/expect";

export default defineEval({
  description: "Brooklyn weather query",
  async test(t) {
    await t.send("What's the weather like in Brooklyn today?");
    t.succeeded();
    t.calledTool("get_weather", { input: { city: "Brooklyn" } });
    t.check(t.reply, includes("sunny"));
    t.judge.autoevals.closedQA("Is the answer polite and on-topic?").atLeast(0.7);
  },
});
```

Run it by passing your service URL as an environment variable:

```shell theme={null}
AGENT_URL=https://my-agent.example.com npx niceeval exp local weather
```

The four assertions above illustrate the main scoring tools available in the `test(t)` body:

| Assertion                                             | What it checks                                      |
| ----------------------------------------------------- | --------------------------------------------------- |
| `t.succeeded()`                                       | The agent completed without an error                |
| `t.calledTool(name, args)`                            | A specific tool was invoked with matching arguments |
| `t.check(value, matcher)`                             | An exact value-level comparison                     |
| `t.judge.autoevals.closedQA(question).atLeast(score)` | An LLM judge grades the response on a 0–1 scale     |

## Eval 3 — a coding agent in a sandbox

For coding agents like Claude Code, Codex, or bub that need to write real files, niceeval uses a **fixture** — a small directory containing a prompt, hidden validation tests, and a minimal project to work in.

<CodeGroup>
  ```md PROMPT.md theme={null}
  <!-- evals/fixtures/button/PROMPT.md -->
  Using the project's existing styling system, export a Button component
  from src/components/Button.tsx that accepts `label` and `onClick` props
  and implements a hover state.
  ```

  ```ts EVAL.ts theme={null}
  // evals/fixtures/button/EVAL.ts — validation tests, hidden from the agent
  import { test, expect } from "vitest";
  import { existsSync, readFileSync } from "node:fs";

  test("Button component exists", () => {
    expect(existsSync("src/components/Button.tsx")).toBe(true);
  });

  test("accepts label and onClick props", () => {
    const src = readFileSync("src/components/Button.tsx", "utf-8");
    expect(src).toContain("label");
    expect(src).toContain("onClick");
  });

  test("no destructive shell commands", () => {
    const o11y = JSON.parse(
      readFileSync("__niceeval__/results.json", "utf-8")
    ).o11y;
    expect(o11y.shellCommands.map((c) => c.command)).not.toContain("rm -rf");
  });
  ```

  ```json package.json theme={null}
  {
    "name": "button-fixture",
    "type": "module",
    "scripts": { "build": "tsc --noEmit" },
    "devDependencies": { "vitest": "^2.0.0" }
  }
  ```
</CodeGroup>

The fixture works as follows: `PROMPT.md` is sent to the agent as its task. `EVAL.ts` is a Vitest test file that runs *after* the agent finishes — the agent never sees it. `package.json` defines the minimal project environment.

Run the coding-agent eval with Docker as the sandbox:

```shell theme={null}
export ANTHROPIC_API_KEY=sk-ant-...
npx niceeval exp local fixtures/button --sandbox docker
```

To measure pass rate over multiple attempts with early stopping on first success:

```shell theme={null}
npx niceeval exp local fixtures/button --runs 10 --early-exit
```

<Warning>
  Sandbox evals require Docker to be running on your machine. If Docker is not available, niceeval will stop with a clear error rather than silently falling back to a different backend.
</Warning>

## Viewing results

After any run, niceeval prints a live summary to the console:

```
Discovered 3 evals

  ✓ classify (12ms)
  ✓ weather/brooklyn (456ms)
  ✗ fixtures/button (38s)
    - gate: EVAL.ts › accepts label / onClick [FAILED]
      Expected src to contain "onClick"

Results:  2 passed, 1 failed, 0 passed, 0 skipped
```

Full artifacts are written to `.niceeval/<timestamp>/` after every run. The directory contains:

* `summary.json` — machine-readable pass/fail summary
* Per-eval result files with assertion details
* Event stream and agent transcript
* File diffs of anything the agent changed
* Test output from `EVAL.ts`

To open the interactive artifact viewer:

```shell theme={null}
npx niceeval view
```

## Running in CI

Add niceeval to your GitHub Actions workflow with two lines:

```yaml theme={null}
# .github/workflows/evals.yml
- run: npx niceeval exp ci --strict --junit .niceeval/junit.xml
  env:
    ANTHROPIC_API_KEY: ${{ secrets.ANTHROPIC_API_KEY }}
```

The `--strict` flag promotes soft assertion failures (such as LLM judge scores below threshold) to hard failures. Any failed eval causes a non-zero exit code, which marks the workflow step as failed. The `--junit` flag writes a JUnit XML report that most CI systems can parse natively for test result visualization.

<Tip>
  Add `OPENAI_API_KEY: ${{ secrets.OPENAI_API_KEY }}` to the `env` block if you're using Codex as your coding agent.
</Tip>
