Sandbox fixtures: evaluate coding agents with tasks

When you evaluate a coding agent — one that reads files, writes code, and runs shell commands — you need more than a chat assertion. You need to give the agent a real workspace, let it do its work in isolation, and then inspect the result. niceeval handles this through sandbox fixtures: directories on disk that describe the task, provide starting files, and contain hidden validation tests. The framework discovers them automatically, runs the agent in a fresh sandbox, and grades the output.

What a fixture is

A fixture is a directory that contains at least two files: PROMPT.md and EVAL.ts. Everything else in the directory becomes the agent’s starting workspace.

evals/fixtures/create-button/
├─ PROMPT.md          # The task prompt sent to the agent (required)
├─ EVAL.ts            # Validation tests, hidden from the agent (required)
├─ package.json       # Must include "type": "module"
├─ src/               # Any starting workspace files the agent can see
└─ tsconfig.json

niceeval discovers fixtures by looking for any directory containing a PROMPT.md. You don’t need to write a .eval.ts wrapper or register the fixture anywhere. Nested directories work fine: fixtures/api/auth/ is discovered and gets the ID fixtures/api/auth.

PROMPT.md

PROMPT.md is the task description sent to the coding agent. Write it exactly as you’d write a prompt — be specific about what the agent should produce, which files to touch, and any constraints to respect.

<!-- evals/fixtures/button/PROMPT.md -->
Using the project's existing style system, export a Button component from
src/components/Button.tsx that accepts `label` and `onClick` props and
implements a hover state.

The agent receives the full contents of PROMPT.md as its initial message. Keep it self-contained so the task is unambiguous without extra context.

EVAL.ts

EVAL.ts contains your validation logic written in Vitest style. Each test() block becomes a gate assertion in the eval result: if the test fails, the eval fails.

// evals/fixtures/button/EVAL.ts
import { test, expect } from "vitest";
import { existsSync, readFileSync } from "node:fs";

test("Button file exists", () => {
  expect(existsSync("src/components/Button.tsx")).toBe(true);
});

test("Accepts label and onClick props", () => {
  const src = readFileSync("src/components/Button.tsx", "utf-8");
  expect(src).toContain("label");
  expect(src).toContain("onClick");
});

EVAL.ts is hidden from the agent during execution. niceeval only uploads it to the sandbox after the agent has finished running. This prevents the agent from reading the expected answers and writing code that trivially passes without solving the actual task.

Workspace files

Every file in the fixture directory other than EVAL.ts is part of the agent’s visible workspace. The agent can read, edit, and delete them freely. Common things to include:

A package.json with "type": "module" and any project dependencies
Starter source files the agent should build on or refactor
A tsconfig.json if the project uses TypeScript
Any configuration files the agent might need (eslint.config.js, .prettierrc, etc.)

package.json must include "type": "module" for niceeval’s module loading to work correctly inside the sandbox.

Auto-discovery

niceeval scans your eval directory for any subdirectory that contains a PROMPT.md file. There is no registration step. The fixture’s ID is derived from its path relative to the eval root, the same way .eval.ts IDs are derived:

Fixture path	Eval ID
`evals/fixtures/button/`	`fixtures/button`
`evals/fixtures/api/auth/`	`fixtures/api/auth`

You can filter to a specific fixture using its ID prefix after the experiment selector: npx niceeval exp local fixtures/button.

Running a fixture

Set your API key

Export the API key for the coding agent you want to evaluate.

export ANTHROPIC_API_KEY=sk-ant-...

Run with a sandbox backend

Select the coding agent in your experiment file and use --sandbox only when you need to override the isolation backend.

npx niceeval exp local fixtures/button --sandbox docker

niceeval will start a fresh Docker container, upload the workspace files (excluding EVAL.ts), run the agent, upload EVAL.ts, execute the Vitest tests, collect the diff, and tear down the container.

Run multiple times for a pass rate

Use --runs to measure reliability. Add --early-exit to stop as soon as one run passes.

npx niceeval exp local fixtures/button --runs 10 --early-exit

Asserting agent behavior with o11y

Beyond asserting file contents, you can assert what the agent did — which shell commands it ran, which tools it called, and how it navigated the task. After the agent finishes, niceeval injects an observability summary into the sandbox at __niceeval__/results.json. Your EVAL.ts can read this file.

import { test, expect } from "vitest";
import { readFileSync } from "node:fs";

test("Used the scaffold command instead of writing files by hand", () => {
  const o11y = JSON.parse(
    readFileSync("__niceeval__/results.json", "utf-8")
  ).o11y;
  const cmds = o11y.shellCommands.map((c: { command: string }) => c.command);
  expect(cmds.some((c) => c.includes("create-next-app"))).toBe(true);
});

test("Did not run a destructive command", () => {
  const o11y = JSON.parse(
    readFileSync("__niceeval__/results.json", "utf-8")
  ).o11y;
  expect(
    o11y.shellCommands.map((c: { command: string }) => c.command)
  ).not.toContain("rm -rf");
});

The o11y object includes fields like shellCommands (with each command’s text and exit code), tool calls, and subagent invocations. This lets you gate on how the agent achieved its result, not only what it produced.

The `defineAgentEval` alternative

If you prefer to define fixture-style evals in code — for example, to share assertion logic across multiple tasks or to control the execution flow programmatically — use defineAgentEval:

// evals/refactor.eval.ts
import { defineAgentEval } from "niceeval";
import { includes } from "niceeval/expect";

export default defineAgentEval({
  description: "Rewrite callbacks to async/await",
  prompt: "Rewrite all callbacks in src/legacy.js to async/await, preserving behavior.",
  files: "./fixtures/legacy-callbacks",     // Starting workspace files
  async test(t) {
    await t.run();                          // Drive the agent
    t.fileChanged("src/legacy.js");
    t.check(t.sandbox.diff.get("src/legacy.js"), includes("await"));
    await t.script("test");                 // Run npm run test
    t.check(await t.sandbox.runCommand("npm", ["test"], { cwd: "/workspace" }), commandSucceeded());
  },
});

Fixture vs defineAgentEval — when to use which

	Fixture (directory)	`defineAgentEval`
Discovery	Automatic	Requires a `.eval.ts` file
Validation	Vitest tests in `EVAL.ts`	Programmatic assertions in `test(t)`
Best for	Large suites, multi-language projects	Fine-grained control, shared assertion logic
Assertion style	Vitest `expect`	niceeval `t.*` methods

Both approaches share the same scoring, running, and reporting pipeline.

Workspace assertions in `defineAgentEval`

When you use defineAgentEval, the t context exposes workspace-level assertions you can call directly:

t.fileChanged("src/Button.tsx");        // Assert the file was modified
t.fileDeleted("src/old.ts");            // Assert the file was removed
t.check(await t.sandbox.runCommand("npm", ["test"], { cwd: "/workspace" }), commandSucceeded());                        // Assert EVAL.ts tests all pass
t.check(await t.sandbox.runCommand("npm", ["run", "build"], { cwd: "/workspace" }), commandSucceeded());               // Assert npm run build exits 0
t.sandbox.diff.isEmpty();                       // Assert no repository files were changed
t.notInDiff(/sk-[A-Za-z0-9]/);         // Assert no secrets appear in the diff

t.sandbox.diff is a queryable object: t.sandbox.diff.get(path) returns the post-change contents of a file, t.sandbox.diff.isEmpty() checks for no changes, and t.sandbox.diff.matches(re) / t.notInDiff(re) test the full diff text against a regular expression.

​What a fixture is

​PROMPT.md

​EVAL.ts

​Workspace files

​Auto-discovery

​Running a fixture

​Asserting agent behavior with o11y

​The defineAgentEval alternative

​Workspace assertions in defineAgentEval

What a fixture is

PROMPT.md

EVAL.ts

Workspace files

Auto-discovery

Running a fixture

Asserting agent behavior with o11y

The `defineAgentEval` alternative

Workspace assertions in `defineAgentEval`