send verb and never branches on agent names or sandbox backends.
What the runner does and doesn’t own
The runner does own: discovering evals, computing fingerprints to decide what to skip, building the attempt list, bounded-concurrency scheduling, retrying suspicious fast failures, early-exit after a first pass, handing results to reporters, writing artifacts to disk, and setting the process exit code. The runner does not own: how to drive an agent (that belongs to the Agent/Adapter), how to score a response (that belongs to Scorers), or the fine-grained format of stored artifacts (that belongs to Reporters). The runner is a coordinator, not an executor.Eval discovery
The runner scans yourevals/ directory at startup:
*.eval.tsfiles — Each file is imported and its default export is inspected. A singledefineEvalexport gets the file’s path as its ID. An array export is fanned out into multiple evals with zero-padded indices appended (sql/0000,sql/0001, …).- Directories containing
PROMPT.md— These are fixture evals. The runner derives the eval ID from the directory’s relative path. - Stable ordering — All discovered evals are sorted by relative path so IDs are stable and output is diffable across runs.
Bounded concurrency
The runner’s core loop keeps at mostmaxConcurrency attempts in flight at once. When the pool is full it waits for any one attempt to complete before dispatching the next (Promise.race). Reporting callbacks run on a separate serial queue so they never consume execution slots:
--max-concurrency flag → config maxConcurrency → built-in default.
Result caching
The runner computes a hash of each eval’s fixture content and relevant configuration before dispatching it. If the previous run produced apassed result and the fingerprint is unchanged, the runner skips that eval and reuses the cached result.
The cache is intentionally conservative:
- Change a fixture file → cache miss, re-run.
- Change relevant config (e.g.
judge.model) → cache miss, re-run. - A previously failed result is never cached — failures always re-run.
- Pass
--forceto bypass the cache entirely and re-run everything.
Early-exit
When you run an eval multiple times (e.g.--runs 5 for pass-rate measurement), earlyExit stops the remaining retries for an individual eval as soon as one attempt passes:
- Each eval gets its own
AbortController. - When an attempt passes and
earlyExitis enabled, the runner callsabort()on that eval’s controller, cancelling its remaining queued attempts. - Aborted attempts are not counted in the denominator — they are treated as if they never ran.
earlyExitis on by default. Use--no-early-exitwhen you want the full pass-rate distribution (e.g. for nightly stability runs).
Budget guardrails
You can set abudget (an estimated cost ceiling in USD) either in your experiment config or via --budget. Before dispatching each new attempt, the runner accumulates the cost of completed attempts using token usage × price table. Once the accumulated cost exceeds the budget:
- The runner stops dispatching new attempts.
- Any in-flight attempts are allowed to finish.
- The run concludes early and emits a
run:budgetExceededevent.
Retry for infrastructure flaps
Infrastructure flaps — a sandbox that fails to start, a momentary network error, a rate-limit spike — produce failures that have nothing to do with model quality. The runner distinguishes these from genuine model failures using a simple heuristic: Automatically retried: a failure where the attempt ran for less than 5 seconds and the error is not a timeout. These “instant crashes” are almost always infrastructure noise. The runner retries them with exponential backoff and jitter, up to 5 times. Not retried: failures where the attempt ran for a meaningful amount of time, or where tests simply didn’t pass. Those are valid signal — retrying them would mask real regressions.This rule cleanly separates infrastructure noise from model capability signal. A coding agent that runs for 40 seconds and produces broken tests has genuinely failed; a sandbox that exits in 300 ms has not.
Timeout: double-layer protection
Timeouts are enforced at two levels to ensure a stuck eval can never hang the entire run:- Adapter inner timeout — the agent CLI’s own timeout, managed inside the adapter.
- Runner outer timeout — a
Promise.raceagainstAbortSignal.timeout. Even if the agent process freezes completely, the runner forcibly terminates the attempt, marks itfailedwitherror: timeout, and triggers abort cleanup.
--timeout <ms> or timeoutMs in config.
Lifecycle events
The runner emits a structured event stream as it executes, consumed by the CLI dashboard, reporters, and any external integrations:test(t), and keep agent-specific installation or authentication in the adapter’s setup.
Exit codes
| Condition | Exit code |
|---|---|
All evals passed or passed (without --strict) | 0 |
Any eval failed | Non-zero |
Any eval passed with --strict enabled | Non-zero |