niceeval architecture: evals, agents, and sandboxes

niceeval is organized around three concerns that stay permanently separated: what to test (your evals/ directory), how to run and score it (the niceeval core), and how to reach the thing being tested (the agent adapter and sandbox). Understanding where each boundary falls makes it much easier to write evals, build adapters, and interpret results — so this page walks the full architecture before you write a single line of eval code.

The four-layer architecture

   Your evals/ directory            niceeval core              Connect to AI (self-authored adapters)
   ----------------------           --------------             ------------------------------------
   weather.eval.ts  --discover-->   Runner  --send-->  Agent   ┬─ in-process adapter  (your agent)
   sql.eval.ts                        │                        ├─ remote adapter      (your service)
   fixtures/button/ --fixture-->      │                        └─ sandbox adapter ─── Sandbox
     PROMPT.md                        │                           (claude-code          (docker /
     EVAL.ts                          ▼                            codex / bub …)        third-party)
                                  Scorers ── Reporters ── .niceeval/<run>/
                                  (expect / scoped /     (summary.json / event stream /
                                   judge / tests)         transcript / diff)

Each layer owns a distinct slice of the problem. The core never reaches through the adapter boundary — it dispatches against interfaces, and the adapter decides how to fulfill them. That wall is the structural load-bearer of the entire design.

What the core owns

The niceeval core is everything that looks the same regardless of which AI you’re testing:

Eval discovery

Scans your evals/ directory for *.eval.ts files and fixture directories, derives each eval’s ID from its file path, and builds the run queue.

Concurrency scheduling

Dispatches evals up to maxConcurrency at a time, respects timeoutMs, and manages earlyExit across retries of the same task.

Assertion collection & scoring

Gathers every assertion registered by t.check, t.require, scoped assertions, and judge calls, then folds them into a single outcome per eval.

Caching

Fingerprints each eval. If the eval, its inputs, and the agent haven’t changed since the last run, the cached result is returned and the eval is skipped.

Reporting

Streams live output to the console and hands off to configured reporters (JUnit, JSON, custom) once a run completes.

Artifact persistence

Writes the full run record to .niceeval/<run>/ — summary.json, per-eval results, event streams, transcripts, diffs, and test output.

What the Agent / Adapter boundary means

niceeval does not define a universal agent protocol. There is no --url flag, no shared wire format, and no assumption that your AI can speak a particular API shape.

Every system under test — your own agent, a deployed HTTP service, Claude Code, Codex, bub — is connected through a self-authored adapter. Two terms name two sides of the same concept:

Agent is the abstraction. From the core’s perspective an agent is a named object with a set of capabilities and a single send method that accepts a TurnInput and returns a Turn. The runner only ever calls this interface.
Adapter is the concrete implementation you write (or use from niceeval’s built-ins). It knows how to authenticate, how to call your service, and — critically — how to map whatever your AI returns into the standard event stream the core expects.

Experiments reference agents directly. The URL (or API key, or CLI invocation) is the adapter’s private configuration, invisible to the core and to your evals.

Why there is no `--url` flag

Some eval frameworks define a protocol and let you point at any compatible endpoint with --url. niceeval explicitly rejects this model: there is no single protocol that all AI agents speak. Rather than forcing you to wrap your agent in a compatibility shim, niceeval puts the adaptation work where it belongs — inside the adapter — and keeps the core protocol-agnostic. The result is that any AI, any transport, and any framework can be evaluated with the same defineEval + assertion vocabulary.

What the Sandbox owns

A sandbox is where a sandbox-type agent runs — the isolated execution environment that provides the filesystem, the network policy, and the process boundary. Sandboxes are completely separate from agents:

Docker
Vercel Sandbox
Third-party

The default local sandbox. niceeval spins up a container, uploads the workspace fixture, runs the agent CLI inside it, then reads back the transcript and diff. No cloud credentials required beyond the agent’s own API key.

Any backend that implements the Sandbox interface can be plugged in. The agent adapter receives a ctx.sandbox handle and interacts with it through that interface, so swapping sandbox backends requires no changes to the agent or your evals.

Remote agents (those using defineAgent) ignore --sandbox entirely — they have no need for an isolated workspace. Only sandbox agents (those using defineSandboxAgent) require a sandbox backend. The runner selects a sandbox with --sandbox <backend> and passes a prepared Sandbox handle through ctx.sandbox. The agent and sandbox are orthogonal: claude-code can run in Docker or Vercel; the same Docker sandbox can run claude-code or bub. Neither side needs to know anything about the other.

Key terminology

Eval

A single test case: one description, one agent reference, and one async test(t) function. Evals are the unit of discovery, scheduling, scoring, and reporting. Each eval produces exactly one outcome (or a pass-rate summary when runs > 1).

Agent

The abstraction for a system under test. An agent has a name, a set of capability flags, and a send method. The runner only sees this interface.

Adapter

The concrete implementation of an agent. An adapter knows how to connect to a specific AI — your service, Claude Code, a local function — and normalizes its output into the standard event stream.

Sandbox

The isolated execution environment for sandbox-type agents: a Docker container, Vercel Sandbox, or other backend that provides a filesystem and process boundary.

Turn

The result of one t.send(...) call: a standard event stream, an optional structured-output field (data), a status, and optional token usage. All scoped assertions read from Turn.events.

Artifact

Any file written to .niceeval/<run>/ after a run: summary.json, per-eval results, event streams, transcripts, generated-file diffs, and test output. Artifacts are the source of truth for debugging and regression analysis.

Experiment

A named configuration matrix — a combination of evals, agent, sandbox, model, feature flags, run count, and budget — that produces a comparable, replayable run. Experiments are how you compare agent A vs agent B, or model X vs model Y, over the same eval suite.

How the parts connect end-to-end

Discovery

The runner scans evals/ for *.eval.ts files and fixture directories (PROMPT.md present). Each file path becomes an eval ID — evals/weather/brooklyn.eval.ts becomes weather/brooklyn.

Scheduling

Evals are dispatched up to maxConcurrency at a time. Fingerprint caching skips evals that haven’t changed since their last passing run.

Agent send

For each eval, the runner calls agent.send(input, ctx). The adapter drives the subject under test and returns a Turn containing the standard event stream.

Scoring

The core evaluates all registered assertions against the Turn. Gate assertions that fail mark the eval failed. Soft assertions below their threshold mark it passed.

Outcome

One outcome per eval: passed, failed, passed, or skipped. When runs > 1, the summary is a pass-rate and average latency.

Reporting & artifacts

Results stream to the console in real time. Once complete, reporters write structured output and .niceeval/<run>/ is populated with the full artifact set.

Evals — what an eval is and how the lifecycle works in detail.
Agents & Adapters — how to write an adapter and reference it from experiments.
Scoring — the full assertion vocabulary and outcome rules.

​The four-layer architecture

​What the core owns

Eval discovery

Concurrency scheduling

Assertion collection & scoring

Caching

Reporting

Artifact persistence

​What the Agent / Adapter boundary means

​Why there is no --url flag

​What the Sandbox owns

​Key terminology

​How the parts connect end-to-end

​Related pages

The four-layer architecture

What the core owns

What the Agent / Adapter boundary means

Why there is no `--url` flag

What the Sandbox owns

Key terminology

How the parts connect end-to-end

Related pages