eval
January 6, 2026 ยท View on GitHub
Evaluate a directory of agent inference results.
Quick Start
# Evaluate all problems in a run directory
slop-code eval outputs/my_run
# Evaluate specific problems
slop-code eval outputs/my_run --problem file_backup --problem trajectory_api
# Evaluate with parallel workers
slop-code eval outputs/my_run --num-workers 4
Usage
slop-code eval [OPTIONS] RUN_DIR
Arguments
| Argument | Required | Description |
|---|---|---|
RUN_DIR | Yes | Path to the run directory (outputs/<model_name>/<run_name>) |
Options
| Option | Type | Default | Description |
|---|---|---|---|
--problem | string | (all) | Name of specific problems to evaluate (repeatable) |
--pass-policy | enum | ALL_CASES | Policy to determine if checkpoint passed |
-e, --env-config | path | <run>/environment.yaml | Path to environment configuration |
--live-progress/--no-live-progress | flag | false | Enable live progress display |
-proc, --num-workers | int | 1 | Number of parallel evaluation workers |
--overwrite | flag | false | Re-evaluate problems with existing results |
Pass Policy Values
| Value | Description |
|---|---|
any | Pass if at least one case passes |
any-case | Same as any |
all-cases | Pass only if all test cases pass |
all-non-error-cases | Pass if all non-error cases pass |
core-cases | Pass if all core cases pass |
any-core-cases | Pass if any core case passes |
all-core-cases | Same as core-cases |
Behavior
The eval command:
- Discovers all problem directories within
AGENT_RUN_DIR - Skips problems that already have
evaluation.jsonfiles (unless--overwrite) - Re-evaluates if problem configuration has changed since last evaluation
- Writes evaluation results to each checkpoint directory
- Generates
checkpoint_results.jsonlreport at run level
Auto-Skip Logic
When no --problem flags are specified and --overwrite is not set, the command automatically skips problems where:
- All checkpoints have
evaluation.jsonfiles - The problem configuration hasn't changed since evaluation
To force re-evaluation, use --overwrite or specify the problem explicitly with --problem.
Output Files
After evaluation, each checkpoint directory contains:
evaluation.json- Detailed evaluation results- Test case reports
At the run level:
checkpoint_results.jsonl- Consolidated report with one line per checkpoint
Examples
Basic evaluation:
slop-code eval outputs/claude_code_run_20251217
Evaluate with custom environment:
slop-code eval outputs/my_run -e configs/environments/docker-python3.12-uv.yaml
Force re-evaluation of all problems:
slop-code eval outputs/my_run --overwrite
Parallel evaluation with progress:
slop-code eval outputs/my_run --num-workers 8 --live-progress
See Also
- eval-problem - Evaluate a single problem
- eval-snapshot - Evaluate a single snapshot
- run - Run agents (includes automatic evaluation)