evals/ directory), how to run and score it (the niceeval core), and how to reach the thing being tested (the agent adapter and sandbox). Understanding where each boundary falls makes it much easier to write evals, build adapters, and interpret results — so this page walks the full architecture before you write a single line of eval code.
The four-layer architecture
What the core owns
The niceeval core is everything that looks the same regardless of which AI you’re testing:Eval discovery
Scans your
evals/ directory for *.eval.ts files and fixture directories, derives each eval’s ID from its file path, and builds the run queue.Concurrency scheduling
Dispatches evals up to
maxConcurrency at a time, respects timeoutMs, and manages earlyExit across retries of the same task.Assertion collection & scoring
Gathers every assertion registered by
t.check, t.require, scoped assertions, and judge calls, then folds them into a single outcome per eval.Caching
Fingerprints each eval. If the eval, its inputs, and the agent haven’t changed since the last run, the cached result is returned and the eval is skipped.
Reporting
Streams live output to the console and hands off to configured reporters (JUnit, JSON, custom) once a run completes.
Artifact persistence
Writes the full run record to
.niceeval/<run>/ — summary.json, per-eval results, event streams, transcripts, diffs, and test output.What the Agent / Adapter boundary means
niceeval does not define a universal agent protocol. There is no
--url flag, no shared wire format, and no assumption that your AI can speak a particular API shape.- Agent is the abstraction. From the core’s perspective an agent is a named object with a set of capabilities and a single
sendmethod that accepts aTurnInputand returns aTurn. The runner only ever calls this interface. - Adapter is the concrete implementation you write (or use from niceeval’s built-ins). It knows how to authenticate, how to call your service, and — critically — how to map whatever your AI returns into the standard event stream the core expects.
Why there is no --url flag
Some eval frameworks define a protocol and let you point at any compatible endpoint with --url. niceeval explicitly rejects this model: there is no single protocol that all AI agents speak. Rather than forcing you to wrap your agent in a compatibility shim, niceeval puts the adaptation work where it belongs — inside the adapter — and keeps the core protocol-agnostic. The result is that any AI, any transport, and any framework can be evaluated with the same defineEval + assertion vocabulary.
What the Sandbox owns
A sandbox is where a sandbox-type agent runs — the isolated execution environment that provides the filesystem, the network policy, and the process boundary. Sandboxes are completely separate from agents:- Docker
- Vercel Sandbox
- Third-party
The default local sandbox. niceeval spins up a container, uploads the workspace fixture, runs the agent CLI inside it, then reads back the transcript and diff. No cloud credentials required beyond the agent’s own API key.
defineAgent) ignore --sandbox entirely — they have no need for an isolated workspace. Only sandbox agents (those using defineSandboxAgent) require a sandbox backend.
The runner selects a sandbox with --sandbox <backend> and passes a prepared Sandbox handle through ctx.sandbox. The agent and sandbox are orthogonal: claude-code can run in Docker or Vercel; the same Docker sandbox can run claude-code or bub. Neither side needs to know anything about the other.
Key terminology
Eval
Eval
A single test case: one description, one agent reference, and one
async test(t) function. Evals are the unit of discovery, scheduling, scoring, and reporting. Each eval produces exactly one outcome (or a pass-rate summary when runs > 1).Agent
Agent
The abstraction for a system under test. An agent has a name, a set of capability flags, and a
send method. The runner only sees this interface.Adapter
Adapter
The concrete implementation of an agent. An adapter knows how to connect to a specific AI — your service, Claude Code, a local function — and normalizes its output into the standard event stream.
Sandbox
Sandbox
The isolated execution environment for sandbox-type agents: a Docker container, Vercel Sandbox, or other backend that provides a filesystem and process boundary.
Turn
Turn
The result of one
t.send(...) call: a standard event stream, an optional structured-output field (data), a status, and optional token usage. All scoped assertions read from Turn.events.Artifact
Artifact
Any file written to
.niceeval/<run>/ after a run: summary.json, per-eval results, event streams, transcripts, generated-file diffs, and test output. Artifacts are the source of truth for debugging and regression analysis.Experiment
Experiment
A named configuration matrix — a combination of evals, agent, sandbox, model, feature flags, run count, and budget — that produces a comparable, replayable run. Experiments are how you compare agent A vs agent B, or model X vs model Y, over the same eval suite.
How the parts connect end-to-end
Discovery
The runner scans
evals/ for *.eval.ts files and fixture directories (PROMPT.md present). Each file path becomes an eval ID — evals/weather/brooklyn.eval.ts becomes weather/brooklyn.Scheduling
Evals are dispatched up to
maxConcurrency at a time. Fingerprint caching skips evals that haven’t changed since their last passing run.Agent send
For each eval, the runner calls
agent.send(input, ctx). The adapter drives the subject under test and returns a Turn containing the standard event stream.Scoring
The core evaluates all registered assertions against the
Turn. Gate assertions that fail mark the eval failed. Soft assertions below their threshold mark it passed.Outcome
One outcome per eval:
passed, failed, passed, or skipped. When runs > 1, the summary is a pass-rate and average latency.Related pages
- Evals — what an eval is and how the lifecycle works in detail.
- Agents & Adapters — how to write an adapter and reference it from experiments.
- Scoring — the full assertion vocabulary and outcome rules.