Skip to main content
This guide takes you from a blank project to three working evals: one that calls an in-process function, one that drives a conversational agent over HTTP, and one that drops a coding agent into a Docker sandbox to write real code. By the end you’ll have a runnable eval suite with CI-ready output. If you already know what you want to evaluate, jump straight to the relevant example:

Eval a Claude Code / Codex plugin

For plugins, MCP servers, and project-level coding-agent extensions.

Eval a Claude Code / Codex Skill

Verify that a Skill is triggered, follows its prescribed flow, and actually improves task success rate.

Eval an AI agent application

For HTTP agents, AI SDK, LangGraph, or any custom agent service.
1
Install niceeval
2
Add niceeval as a dev dependency, then run the scaffold command:
3
npm install -D niceeval
npx niceeval init
4
npx niceeval init reads your project layout and generates everything you need to run your first eval immediately.
5
Explore the generated files
6
After init completes, your project contains the following new entries:
7
your-project/
├─ niceeval.config.ts
└─ evals/
   ├─ hello.eval.ts            # example: conversational eval
   └─ fixtures/
      └─ button/               # example: sandbox coding-agent eval
         ├─ PROMPT.md
         ├─ EVAL.ts
         └─ package.json
8
niceeval.config.ts is your central configuration — it sets the judge model, reporters, concurrency, timeout, and sandbox backend. evals/ is where all your eval files live; the file path automatically becomes the eval’s ID.
9
You can safely delete or replace the generated example files once you’ve read through them. They’re there to illustrate the shape of each eval type, not to run against a real agent.
10
Configure your project
11
Open niceeval.config.ts and review the defaults:
12
// niceeval.config.ts
import { defineConfig } from "niceeval";
import { Console, JUnit } from "niceeval/reporters";

export default defineConfig({
  judge: { model: "anthropic/claude-haiku-4-5" },
  reporters: [Console(), JUnit(".niceeval/junit.xml")],
  maxConcurrency: 8,
  timeoutMs: 300_000,
  sandbox: "auto",  // uses cloud token if available, otherwise Docker
});
13
Set ANTHROPIC_API_KEY (for Claude Code or the judge model) or OPENAI_API_KEY (for Codex) in your environment before proceeding.

Eval 1 — an in-process function

The fastest way to start is evaluating a TypeScript function that lives in your own codebase. You define an agent adapter that calls your function directly, then write an eval that checks the output. No network, no Docker. First, create the agent adapter that wraps your function:
// agents/classify.ts
import { defineAgent } from "niceeval/adapter";
import { classifyIntent } from "../src/agent.js"; // your own code

export default defineAgent({
  name: "classify",
  async send(input) {
    return { data: await classifyIntent(input.text), status: "completed" };
  },
});
Then write the eval file:
// evals/classify.eval.ts
import { defineEval } from "niceeval";
import { equals } from "niceeval/expect";

export default defineEval({
  description: "Intent classification: refund request",
  async test(t) {
    const turn = await t.send("I'd like to return my order and get a refund.");
    t.check(turn.data, equals({ intent: "refund" }));
  },
});
Add the classify agent to an experiment file such as experiments/local.ts, then run:
npx niceeval exp local classify
In-process evals are ideal for semantic regression testing. Because they call your code directly, they run in milliseconds and slot naturally into a standard CI pipeline alongside unit tests.

Eval 2 — a conversational agent

To eval an agent that lives behind an HTTP endpoint, you write an adapter that handles the request/response cycle. The URL and any credentials are the adapter’s concern — niceeval has no --url flag and imposes no protocol requirements. Define the remote agent adapter:
// agents/weather-bot.ts
import { defineAgent } from "niceeval/adapter";

export default defineAgent({
  name: "weather-bot",
  capabilities: { conversation: true, toolObservability: true },
  async send(input, ctx) {
    const r = await fetch(`${process.env.AGENT_URL}/chat`, {
      method: "POST",
      body: JSON.stringify({ message: input.text }),
      signal: ctx.signal,
    });
    const body = await r.json();
    return { message: body.reply, toolCalls: body.tools, status: "completed" };
  },
});
Write the eval:
// evals/weather/brooklyn.eval.ts
import { defineEval } from "niceeval";
import { includes } from "niceeval/expect";

export default defineEval({
  description: "Brooklyn weather query",
  async test(t) {
    await t.send("What's the weather like in Brooklyn today?");
    t.succeeded();
    t.calledTool("get_weather", { input: { city: "Brooklyn" } });
    t.check(t.reply, includes("sunny"));
    t.judge.autoevals.closedQA("Is the answer polite and on-topic?").atLeast(0.7);
  },
});
Run it by passing your service URL as an environment variable:
AGENT_URL=https://my-agent.example.com npx niceeval exp local weather
The four assertions above illustrate the main scoring tools available in the test(t) body:
AssertionWhat it checks
t.succeeded()The agent completed without an error
t.calledTool(name, args)A specific tool was invoked with matching arguments
t.check(value, matcher)An exact value-level comparison
t.judge.autoevals.closedQA(question).atLeast(score)An LLM judge grades the response on a 0–1 scale

Eval 3 — a coding agent in a sandbox

For coding agents like Claude Code, Codex, or bub that need to write real files, niceeval uses a fixture — a small directory containing a prompt, hidden validation tests, and a minimal project to work in.
<!-- evals/fixtures/button/PROMPT.md -->
Using the project's existing styling system, export a Button component
from src/components/Button.tsx that accepts `label` and `onClick` props
and implements a hover state.
The fixture works as follows: PROMPT.md is sent to the agent as its task. EVAL.ts is a Vitest test file that runs after the agent finishes — the agent never sees it. package.json defines the minimal project environment. Run the coding-agent eval with Docker as the sandbox:
export ANTHROPIC_API_KEY=sk-ant-...
npx niceeval exp local fixtures/button --sandbox docker
To measure pass rate over multiple attempts with early stopping on first success:
npx niceeval exp local fixtures/button --runs 10 --early-exit
Sandbox evals require Docker to be running on your machine. If Docker is not available, niceeval will stop with a clear error rather than silently falling back to a different backend.

Viewing results

After any run, niceeval prints a live summary to the console:
Discovered 3 evals

  ✓ classify (12ms)
  ✓ weather/brooklyn (456ms)
  ✗ fixtures/button (38s)
    - gate: EVAL.ts › accepts label / onClick [FAILED]
      Expected src to contain "onClick"

Results:  2 passed, 1 failed, 0 passed, 0 skipped
Full artifacts are written to .niceeval/<timestamp>/ after every run. The directory contains:
  • summary.json — machine-readable pass/fail summary
  • Per-eval result files with assertion details
  • Event stream and agent transcript
  • File diffs of anything the agent changed
  • Test output from EVAL.ts
To open the interactive artifact viewer:
npx niceeval view

Running in CI

Add niceeval to your GitHub Actions workflow with two lines:
# .github/workflows/evals.yml
- run: npx niceeval exp ci --strict --junit .niceeval/junit.xml
  env:
    ANTHROPIC_API_KEY: ${{ secrets.ANTHROPIC_API_KEY }}
The --strict flag promotes soft assertion failures (such as LLM judge scores below threshold) to hard failures. Any failed eval causes a non-zero exit code, which marks the workflow step as failed. The --junit flag writes a JUnit XML report that most CI systems can parse natively for test result visualization.
Add OPENAI_API_KEY: ${{ secrets.OPENAI_API_KEY }} to the env block if you’re using Codex as your coding agent.