eval

January 6, 2026 ยท View on GitHub

Evaluate a directory of agent inference results.

Quick Start

# Evaluate all problems in a run directory
slop-code eval outputs/my_run

# Evaluate specific problems
slop-code eval outputs/my_run --problem file_backup --problem trajectory_api

# Evaluate with parallel workers
slop-code eval outputs/my_run --num-workers 4

Usage

slop-code eval [OPTIONS] RUN_DIR

Arguments

ArgumentRequiredDescription
RUN_DIRYesPath to the run directory (outputs/<model_name>/<run_name>)

Options

OptionTypeDefaultDescription
--problemstring(all)Name of specific problems to evaluate (repeatable)
--pass-policyenumALL_CASESPolicy to determine if checkpoint passed
-e, --env-configpath<run>/environment.yamlPath to environment configuration
--live-progress/--no-live-progressflagfalseEnable live progress display
-proc, --num-workersint1Number of parallel evaluation workers
--overwriteflagfalseRe-evaluate problems with existing results

Pass Policy Values

ValueDescription
anyPass if at least one case passes
any-caseSame as any
all-casesPass only if all test cases pass
all-non-error-casesPass if all non-error cases pass
core-casesPass if all core cases pass
any-core-casesPass if any core case passes
all-core-casesSame as core-cases

Behavior

The eval command:

  1. Discovers all problem directories within AGENT_RUN_DIR
  2. Skips problems that already have evaluation.json files (unless --overwrite)
  3. Re-evaluates if problem configuration has changed since last evaluation
  4. Writes evaluation results to each checkpoint directory
  5. Generates checkpoint_results.jsonl report at run level

Auto-Skip Logic

When no --problem flags are specified and --overwrite is not set, the command automatically skips problems where:

  • All checkpoints have evaluation.json files
  • The problem configuration hasn't changed since evaluation

To force re-evaluation, use --overwrite or specify the problem explicitly with --problem.

Output Files

After evaluation, each checkpoint directory contains:

  • evaluation.json - Detailed evaluation results
  • Test case reports

At the run level:

  • checkpoint_results.jsonl - Consolidated report with one line per checkpoint

Examples

Basic evaluation:

slop-code eval outputs/claude_code_run_20251217

Evaluate with custom environment:

slop-code eval outputs/my_run -e configs/environments/docker-python3.12-uv.yaml

Force re-evaluation of all problems:

slop-code eval outputs/my_run --overwrite

Parallel evaluation with progress:

slop-code eval outputs/my_run --num-workers 8 --live-progress

See Also