eval

January 6, 2026 · View on GitHub

Evaluate a directory of agent inference results.

Quick Start

# Evaluate all problems in a run directory
slop-code eval outputs/my_run

# Evaluate specific problems
slop-code eval outputs/my_run --problem file_backup --problem trajectory_api

# Evaluate with parallel workers
slop-code eval outputs/my_run --num-workers 4

Usage

slop-code eval [OPTIONS] RUN_DIR

Arguments

Argument	Required	Description
`RUN_DIR`	Yes	Path to the run directory (outputs/<model_name>/<run_name>)

Options

Option	Type	Default	Description
`--problem`	string	(all)	Name of specific problems to evaluate (repeatable)
`--pass-policy`	enum	`ALL_CASES`	Policy to determine if checkpoint passed
`-e, --env-config`	path	`<run>/environment.yaml`	Path to environment configuration
`--live-progress/--no-live-progress`	flag	false	Enable live progress display
`-proc, --num-workers`	int	1	Number of parallel evaluation workers
`--overwrite`	flag	false	Re-evaluate problems with existing results

Pass Policy Values

Value	Description
`any`	Pass if at least one case passes
`any-case`	Same as `any`
`all-cases`	Pass only if all test cases pass
`all-non-error-cases`	Pass if all non-error cases pass
`core-cases`	Pass if all core cases pass
`any-core-cases`	Pass if any core case passes
`all-core-cases`	Same as `core-cases`

Behavior

The eval command:

Discovers all problem directories within AGENT_RUN_DIR
Skips problems that already have evaluation.json files (unless --overwrite)
Re-evaluates if problem configuration has changed since last evaluation
Writes evaluation results to each checkpoint directory
Generates checkpoint_results.jsonl report at run level

Auto-Skip Logic

When no --problem flags are specified and --overwrite is not set, the command automatically skips problems where:

All checkpoints have evaluation.json files
The problem configuration hasn't changed since evaluation

To force re-evaluation, use --overwrite or specify the problem explicitly with --problem.

Output Files

After evaluation, each checkpoint directory contains:

evaluation.json - Detailed evaluation results
Test case reports

At the run level:

checkpoint_results.jsonl - Consolidated report with one line per checkpoint

Examples

Basic evaluation:

slop-code eval outputs/claude_code_run_20251217

Evaluate with custom environment:

slop-code eval outputs/my_run -e configs/environments/docker-python3.12-uv.yaml

Force re-evaluation of all problems:

slop-code eval outputs/my_run --overwrite

Parallel evaluation with progress:

slop-code eval outputs/my_run --num-workers 8 --live-progress