> ## Documentation Index
> Fetch the complete documentation index at: https://niceeval.com/docs/llms.txt
> Use this file to discover all available pages before exploring further.

# Sandbox agents: evaluate Claude Code, Codex, and bub

> Use niceeval's built-in agents (claude-code, codex, bub) or write a custom adapter to run a coding-agent CLI in an isolated Docker or cloud sandbox.

A sandbox agent spawns a coding-agent CLI inside an isolated environment — a Docker container or cloud micro-VM — gives it a workspace, lets it run freely, then reads back the transcript and validates the result. This is how you evaluate tools like Claude Code, Codex, and bub: they need a real filesystem to write code, execute builds, and call tools, so niceeval provides that filesystem in a throwaway container your host machine never sees.

## Built-in sandbox agents

niceeval ships three sandbox adapters out of the box. Import one into an experiment file and run that experiment.

<CardGroup cols={3}>
  <Card title="claude-code" icon="code">
    Runs Anthropic's Claude Code CLI. Requires `ANTHROPIC_API_KEY`.
  </Card>

  <Card title="codex" icon="terminal">
    Runs OpenAI's Codex CLI. Requires `OPENAI_API_KEY`.
  </Card>

  <Card title="bub" icon="robot">
    Runs the bub coding agent. Authentication follows the bub CLI conventions.
  </Card>
</CardGroup>

## Running a sandbox agent

Select the agent in an experiment file, then run that experiment. The optional `--sandbox` flag only overrides where niceeval spins up the isolated environment (see [Sandbox Backends](/guides/sandbox-backends) for the full list of options).

```shell theme={null}
# Evaluate the button fixture with Claude Code in a local Docker container
export ANTHROPIC_API_KEY=sk-ant-...
npx niceeval exp local fixtures/button --sandbox docker

# Run 10 times and stop as soon as one pass is recorded
npx niceeval exp local fixtures/button --runs 10 --early-exit
```

<Note>
  The `--sandbox docker` flag is optional if Docker is your default backend. Keep the agent choice in `experiments/local.ts` or another signed-in experiment file.
</Note>

### Environment variables by agent

| Agent         | Required variable                    |
| ------------- | ------------------------------------ |
| `claude-code` | `ANTHROPIC_API_KEY`                  |
| `codex`       | `OPENAI_API_KEY`                     |
| `bub`         | *(follows bub CLI auth conventions)* |

***

## How a sandbox agent works

The runner creates the sandbox and commits a baseline, then your eval and adapter decide what to do. Starter files are uploaded explicitly from `test(t)`; validation commands are ordinary `t.sandbox.runCommand(...)` calls.

```
createSandbox(backend, timeout)
  → git init && git commit             # baseline for later diff
  → test(t): uploadDirectory/writeFiles and run setup commands
  → adapter.send(input, ctx)           # ← the adapter's only segment
  → test(t): run validation commands and record assertions
  → collectGeneratedFiles()            # git diff HEAD
  → sandbox.stop()                     # destroy the environment
```

***

## The `defineSandboxAgent` shape

A sandbox adapter receives a `ctx` whose `ctx.sandbox` is the live `Sandbox` handle for the current isolated environment. Your `send` function uses that handle to install the CLI, authenticate, run the agent, and read back the transcript.

```ts theme={null}
import { defineSandboxAgent } from "niceeval/adapter";

defineSandboxAgent({
  name: string;
  async send(input: TurnInput, ctx: AgentContext): Promise<Turn>;
  //                                 ↑ ctx.sandbox is the Sandbox handle
});
```

The five things that differ between coding-agent adapters are:

1. **Install the CLI** — e.g. `npm install -g @anthropic-ai/claude-code`
2. **Authenticate** — read the API key from the environment and pass it to the command
3. **Build the command** — construct the argument list, including the prompt
4. **Pass the model flag** — forward `ctx.model` to the CLI if the experiment specifies one
5. **Read and parse the transcript** — locate the native JSONL output and convert it to `StreamEvent[]`

***

## The built-in `claude-code` adapter (full example)

The source for the built-in Claude Code adapter illustrates all five steps and how the shared helpers fit in:

```ts theme={null}
// agents/claude-code.ts  (built-in; custom agents follow the same shape)
import { defineSandboxAgent, shared } from "niceeval/adapter";
import { requireEnv } from "niceeval";

// Authentication is the adapter's private business — never passed through ctx
const auth = () => ({ ANTHROPIC_API_KEY: requireEnv("ANTHROPIC_API_KEY") });

export default defineSandboxAgent({
  name: "claude-code",
  async send(input, ctx) {
    const sb = ctx.sandbox!;

    // Step 1: install the CLI
    await sb.runCommand("npm", ["install", "-g", "@anthropic-ai/claude-code"]);

    // Step 3 & 4: build the command, forwarding model and feature flags
    const args = ["--print", "--dangerously-skip-permissions"];
    if (ctx.model) args.push("--model", ctx.model);   // only when experiment sets it
    if (ctx.flags.webResearch) args.push("--allowedTools", "WebSearch,WebFetch");
    if (!ctx.session.isNew && ctx.session.id) args.push("--resume", ctx.session.id);
    args.push(input.text);

    // Step 2: authenticate via env, run the agent
    const res = await sb.runCommand("claude", args, { env: auth() });

    // Step 5: read the transcript, parse it into StreamEvent[]
    const raw = await shared.captureLatestJsonl(sb, "~/.claude/projects");
    ctx.session.id = shared.sessionIdFromClaudeTranscript(raw);  // enables multi-turn resume

    return {
      events: parseClaudeCode(raw),   // native JSONL → standard StreamEvent[]
      status: res.exitCode === 0 ? "completed" : "failed",
    };
  },
});
```

***

## Shared helpers

niceeval provides helpers that all sandbox adapters can reuse. Using them ensures that workspace preparation, diff collection, and validation are identical across every agent — results are always apples-to-apples.

| Helper                                         | What it does                                                                                                                                    |
| ---------------------------------------------- | ----------------------------------------------------------------------------------------------------------------------------------------------- |
| `shared.prepareWorkspace(sandbox, fixture)`    | Uploads workspace files (hiding `EVAL.ts`), runs `git init` and commits a baseline                                                              |
| `shared.captureLatestJsonl(sandbox, dir)`      | Finds and reads the most recent JSONL transcript in the given directory                                                                         |
| `shared.runValidation(sandbox, scripts, mode)` | Uploads test files and runs `EVAL.ts` (Vitest) plus any npm scripts                                                                             |
| `shared.injectO11yContext(sandbox, events)`    | Derives the o11y summary from the event stream and writes it to `__niceeval__/results.json` so `EVAL.ts` can assert on agent behavior           |
| `shared.captureGeneratedFiles(sandbox)`        | Runs `git diff HEAD` and returns `{ generated, deleted }` — the file-level diff used for `t.fileChanged`, `t.fileDeleted`, and `t.sandbox.diff` |

***

## Transcript parsing: JSONL → `StreamEvent[]`

Each coding agent writes its own native transcript format. Your adapter's fifth step is converting that format into the standard `StreamEvent[]` vocabulary that all niceeval assertions understand.

```ts theme={null}
// Minimal transcript parser skeleton
import type { StreamEvent } from "niceeval";

function parseClaudeCode(rawJsonl: string): StreamEvent[] {
  const events: StreamEvent[] = [];

  for (const line of rawJsonl.trim().split("\n")) {
    const entry = JSON.parse(line);

    if (entry.type === "assistant" && entry.message?.content) {
      for (const block of entry.message.content) {
        if (block.type === "text") {
          events.push({ type: "message", role: "assistant", text: block.text });
        }
        if (block.type === "tool_use") {
          events.push({ type: "action.called", callId: block.id, name: block.name, input: block.input });
        }
      }
    }

    if (entry.type === "tool_result") {
      events.push({
        type: "action.result",
        callId: entry.tool_use_id,
        output: entry.content,
        status: "completed",
      });
    }
  }

  return events;
}
```

<Tip>
  Once you normalize the transcript into `StreamEvent[]`, the entire suite of niceeval assertions becomes available: `t.calledTool`, `t.toolOrder`, `t.noFailedActions`, `t.messageIncludes`, and more — no extra work required.
</Tip>

***

## `ctx.model` and `ctx.flags`

Experiments pass a model tier and feature flags through `ctx`. Your adapter should forward them rather than hardcoding values — this allows the same adapter to serve multiple experiments without modification.

```ts theme={null}
// Forward the experiment's model tier to the CLI (omit if not set → CLI uses its default)
if (ctx.model) args.push("--model", ctx.model);

// Read a feature flag to conditionally enable a tool
if (ctx.flags.webResearch) args.push("--allowedTools", "WebSearch,WebFetch");
```

<Note>
  `ctx.model` is set by the experiment configuration, not by the adapter. If an experiment doesn't specify a model, `ctx.model` is `undefined` and the agent CLI uses its built-in default.
</Note>

***

## Registering a custom sandbox agent

Built-in agents (`claude-code`, `codex`, `bub`) can be imported into experiment files. Custom adapters follow the same pattern:

```ts theme={null}
// experiments/local.ts
import { defineExperiment } from "niceeval";
import myCustomAgent from "./agents/my-custom-agent.js";

export default defineExperiment({
  agent: myCustomAgent,
  runs: 1,
});
```

Then run that experiment:

```shell theme={null}
npx niceeval exp local fixtures/my-task --sandbox docker
```

<Warning>
  Never read `ANTHROPIC_API_KEY`, `OPENAI_API_KEY`, or other secrets through `ctx`. Authentication is the adapter's private responsibility. Read secrets directly from `process.env` inside your adapter definition — they should never be visible to the experiment or the eval author.
</Warning>
