Eval Your Claude Code / Codex Plugin

If you’re building a plugin for Claude Code or Codex, the right eval approach is to drop it into a real workspace, let the coding agent execute a set of tasks, and measure pass rate, latency, and cost against a known baseline. This is how you demonstrate — with data — exactly how much your plugin improves outcomes. This kind of eval is a natural fit for Sandbox mode: the agent runs inside a Docker or cloud sandbox where it can read and write files, execute commands, and call your plugin.

Recommended directory structure

evals/fixtures/plugin/create-button/
├─ PROMPT.md
├─ EVAL.ts
├─ package.json
├─ tsconfig.json
└─ src/

PROMPT.md is the task description the agent reads. EVAL.ts is the verification script niceeval runs after the agent finishes. Plugin installation, configuration, or tokens can live in the fixture files, in a sandbox lifecycle hook, or in your agent adapter’s setup().

Write the task

PROMPT.md

Use the project plugin to create `src/components/Button.tsx`.

The component must:
- accept `label: string`
- accept `onClick: () => void`
- render a native `<button>`
- pass the existing test suite

Tasks should read like real user requests — don’t give away the answer. To verify whether the plugin was actually used, check the result in EVAL.ts or inspect the o11y summary for tool calls or shell commands.

Write the verification

EVAL.ts

import { test, expect } from "vitest";
import { existsSync, readFileSync } from "node:fs";

test("created Button component", () => {
  expect(existsSync("src/components/Button.tsx")).toBe(true);
});

test("supports required props", () => {
  const src = readFileSync("src/components/Button.tsx", "utf-8");
  expect(src).toContain("label");
  expect(src).toContain("onClick");
  expect(src).toContain("button");
});

If your plugin triggers a specific tool, you can also inspect __niceeval__/results.json:

import { test, expect } from "vitest";
import { readFileSync } from "node:fs";

test("plugin tool was used", () => {
  const result = JSON.parse(readFileSync("__niceeval__/results.json", "utf-8"));
  expect(result.o11y.totalToolCalls).toBeGreaterThan(0);
});

Run

npx niceeval exp plugin-regression fixtures/plugin/create-button

The experiment group selects the agent configs. The trailing fixtures/plugin/create-button argument is only an eval ID prefix filter.

Compare plugin impact

Model “plugin on” vs “plugin off” as two agents or two experiment cells:

npx niceeval exp plugin-regression fixtures/plugin/create-button

Key things to look at:

Whether pass@N improves.
Whether average latency and token usage are acceptable.
Whether the agent actually invoked the plugin in failing transcripts.
Whether the diff only touched task-relevant files.

Copy to your agent

READ docs-site/example/claude-code-codex-plugin.mdx and install niceeval for this repo.
Create the first fixture from one real plugin workflow, then run it with both claude-code and codex.

Next steps

Fixtures — full reference for the fixture directory layout and EVAL.ts.
Sandbox Agent — built-in claude-code, codex, and custom sandbox agents.
Viewing Results — inspect transcripts, diffs, and the event stream.

​Recommended directory structure

​Write the task

​Write the verification

​Run

​Compare plugin impact

​Copy to your agent

​Next steps