eval-problem

January 6, 2026 · View on GitHub

Evaluate a single problem directory containing multiple checkpoints.

Quick Start

# Evaluate a problem directory
slop-code eval-problem outputs/my_run/file_backup

# With custom environment config
slop-code eval-problem outputs/my_run/file_backup -e configs/environments/docker-python3.12-uv.yaml

# With rubric grading
slop-code eval-problem outputs/my_run/file_backup \
  --rubric configs/rubrics/slop.jsonl \
  --rubric-model anthropic/sonnet-4.5

Usage

slop-code eval-problem [OPTIONS] SUBMISSION_PATH

Arguments

ArgumentRequiredDescription
SUBMISSION_PATHYesPath to the problem directory

Options

OptionTypeDefaultDescription
-p, --problem-namestring(dir name)Name of the problem
-e, --env-configpath../environment.yamlPath to environment configuration
--snapshot-dirstringsnapshotName of snapshot directory in checkpoints
--rubricpath-Path to rubric JSONL file for code quality grading
--rubric-modelstring-Model ID for rubric grading (required if --rubric is set)
--rubric-temperaturefloat0.0Sampling temperature for rubric grading
--rubric-providerenumOPENROUTERLLM provider for grading

Rubric Provider Values

ValueDescription
OPENROUTERUse OpenRouter API (default)
BEDROCKUse AWS Bedrock API

Behavior

The command:

  1. Loads the problem configuration from the problems directory
  2. Iterates through all checkpoint directories (checkpoint_1, checkpoint_2, etc.)
  3. Evaluates each checkpoint's snapshot against test cases
  4. Optionally runs rubric-based code quality grading
  5. Updates the problem-level report

Directory Structure Expected

SUBMISSION_PATH/
├── checkpoint_1/
│   └── snapshot/
│       └── <agent code>
├── checkpoint_2/
│   └── snapshot/
│       └── <agent code>
└── ...

Examples

Basic evaluation:

slop-code eval-problem outputs/my_run/file_backup

Specify problem name explicitly:

slop-code eval-problem outputs/renamed_dir -p file_backup

With rubric grading:

slop-code eval-problem outputs/my_run/file_backup \
  --rubric configs/rubrics/code_quality.jsonl \
  --rubric-model claude-sonnet-4-20250514 \
  --rubric-provider ANTHROPIC

Custom snapshot directory:

slop-code eval-problem outputs/my_run/file_backup --snapshot-dir code

See Also