Experiments: compare agents and models with run matrices

A single eval run tells you whether an agent passed today. An experiment tells you which agent passes most reliably, at what cost, and how performance changes when you swap models or toggle feature flags. Each experiment file is one run configuration; experiment folders group comparable configs side by side.

What an experiment is

An experiment is a defineExperiment configuration that describes one agent, one model, how many times to run each eval, and which evals to include. When you run npx niceeval exp <group>, the runner executes each config in that group under the same bounded-concurrency scheduler and produces a comparative report.

import { defineExperiment } from "niceeval";
import { codexAgent } from "niceeval/adapter";

export default defineExperiment({
  agent: codexAgent(),                  // one agent per experiment file
  model: "gpt-5.4",                    // one model per experiment file
  runs: 5,                              // 5 attempts per eval
  earlyExit: false,                     // run all 5 to get the full distribution
  evals: "memory/",                     // only evals whose ID starts with "memory/"
  budget: 20.00,                        // stop dispatching after ~$20
  flags: {
    webResearch: false,                 // feature flags injected via ctx.flags / t.flags
  },
});

You don’t supply an id or name in defineExperiment. The experiment’s ID is derived from its file path, exactly like evals.

The `defineExperiment` shape

agent

The agent adapter instance to run, such as codexAgent() or your own defineAgent(...) result.

model

A single model identifier (e.g. "gpt-5.4" or "anthropic/claude-opus-4-8"). The model is passed to the agent as ctx.model. To compare models, create multiple experiment files in the same group.

runs

How many times to run each (agent, model, eval) cell. With runs: 5, each cell produces 5 attempts. Results are aggregated into a pass rate rather than a single outcome.

earlyExit

Whether to stop remaining retries for an eval as soon as one attempt passes. Defaults to true. Set to false to collect the full pass-rate distribution — useful for nightly stability runs.

evals

A string prefix (or array of prefixes) filtering which evals are included in this experiment. Works the same as the positional argument to npx niceeval. Omit to include all discovered evals.

flags

A record of feature flags injected into every attempt as ctx.flags (agent side) and t.flags (eval side). Use flags to toggle features — memory backends, tool allowlists, effort settings — without creating separate agent definitions.

budget

An estimated cost ceiling in USD. The runner stops dispatching new attempts once accumulated cost exceeds this value. Can be overridden at runtime with --budget.

Keep environment preparation in ordinary code. Upload starter files and run setup commands inside test(t), or put adapter-specific preparation in the agent adapter’s setup.

Running experiments

# Run all experiments in experiments/compare/
npx niceeval exp compare

# Run a single experiment file
npx niceeval exp compare/codex-gpt-5.4

# Override budget at runtime
npx niceeval exp compare --budget 5.00

# Collect full distribution (disable early-exit)
npx niceeval exp compare --no-early-exit

Organizing experiments: directory as group

The directory an experiment file lives in determines its comparison group. Files in the same directory are treated as peer configurations — running niceeval exp compare runs all of them and places their results side by side in the report. The coding-agent-memory-evals project uses exactly this pattern:

experiments/
└─ compare/
   ├─ bub-gpt-5.4.ts       # bub agent, gpt-5.4
   └─ codex-gpt-5.4.ts     # codex agent, gpt-5.4

Each file is an independent defineExperiment configuration. Running niceeval exp compare executes both and renders a side-by-side comparison report.

import { defineExperiment } from "niceeval";
import { bubAgent } from "niceeval/adapter";

export default defineExperiment({
  agent: bubAgent(),
  model: "gpt-5.4",
  runs: 5,
  evals: "memory/",
});

Matrix expansion

The runner expands selected experiment files × evals × runs. For example: 2 experiment files × runs: 5 × 3 evals = 30 attempts. All 30 run through the same bounded-concurrency pool.

Pass rate reporting

Results are aggregated per (agent, model, eval) cell into a pass rate rather than a single pass/fail outcome:

fixtures/button   claude-code   pass@5 = 4/5 (80%)   mean 34s · 58k tok · $0.44
fixtures/button   codex         pass@5 = 3/5 (60%)   mean 41s · 72k tok · $0.39

Each row shows:

pass@k — how many of the k attempts passed (e.g. pass@5 = 4/5)
mean time — average wall-clock duration per attempt
token usage — average tokens consumed
estimated cost — average USD cost per attempt

Why pass rate matters

A single passing run does not mean an agent is reliable. An agent that passes 1 out of 5 attempts at the same task is fundamentally different from one that passes 5 out of 5 — even though both have at least one passing run. Pass rate measures stability, not luck. This is particularly important for:

Coding agents that interact with real file systems and have inherent non-determinism
Evaluating whether a new model tier or feature flag genuinely improves reliability, or just got lucky once

Model and feature flag injection

The experiment runner injects ctx.model and ctx.flags into every attempt. Your agent reads these to configure itself:

// Inside your sandbox agent's send()
const args = ["--print", "--dangerously-skip-permissions"];
if (ctx.model) args.push("--model", ctx.model);
if (ctx.flags.webResearch) args.push("--allowedTools", "WebSearch,WebFetch");
args.push(input.text);

Your evals can also read flags through t.flags:

export default defineEval({
  async test(t) {
    // t.flags contains the experiment's flags object
    await t.send("Complete the task.");
    t.succeeded();
  },
});

This means the same agent definition and the same eval definition can be reused across every cell of the matrix. You change the experiment configuration, not the agent or eval code.

Budget and full-distribution runs

For production stability measurements you typically want two things together: a meaningful budget ceiling to prevent runaway costs, and earlyExit: false to collect the full distribution rather than stopping after the first pass:

export default defineExperiment({
  agent: codexAgent(),
  model: "gpt-5.4",
  runs: 10,
  earlyExit: false,    // collect all 10 results per cell
  budget: 50.00,       // but stop if accumulated cost hits $50
  evals: "memory/",
});

This gives you true pass@10 distributions while bounding your maximum spend.

​What an experiment is

​The defineExperiment shape

​Running experiments

​Organizing experiments: directory as group

​Matrix expansion

​Pass rate reporting

​Why pass rate matters

​Model and feature flag injection

​Budget and full-distribution runs

What an experiment is

The `defineExperiment` shape

Running experiments

Organizing experiments: directory as group

Matrix expansion

Pass rate reporting

Why pass rate matters

Model and feature flag injection

Budget and full-distribution runs