eval-problem
January 6, 2026 · View on GitHub
Evaluate a single problem directory containing multiple checkpoints.
Quick Start
# Evaluate a problem directory
slop-code eval-problem outputs/my_run/file_backup
# With custom environment config
slop-code eval-problem outputs/my_run/file_backup -e configs/environments/docker-python3.12-uv.yaml
# With rubric grading
slop-code eval-problem outputs/my_run/file_backup \
--rubric configs/rubrics/slop.jsonl \
--rubric-model anthropic/sonnet-4.5
Usage
slop-code eval-problem [OPTIONS] SUBMISSION_PATH
Arguments
| Argument | Required | Description |
|---|---|---|
SUBMISSION_PATH | Yes | Path to the problem directory |
Options
| Option | Type | Default | Description |
|---|---|---|---|
-p, --problem-name | string | (dir name) | Name of the problem |
-e, --env-config | path | ../environment.yaml | Path to environment configuration |
--snapshot-dir | string | snapshot | Name of snapshot directory in checkpoints |
--rubric | path | - | Path to rubric JSONL file for code quality grading |
--rubric-model | string | - | Model ID for rubric grading (required if --rubric is set) |
--rubric-temperature | float | 0.0 | Sampling temperature for rubric grading |
--rubric-provider | enum | OPENROUTER | LLM provider for grading |
Rubric Provider Values
| Value | Description |
|---|---|
OPENROUTER | Use OpenRouter API (default) |
BEDROCK | Use AWS Bedrock API |
Behavior
The command:
- Loads the problem configuration from the problems directory
- Iterates through all checkpoint directories (
checkpoint_1,checkpoint_2, etc.) - Evaluates each checkpoint's snapshot against test cases
- Optionally runs rubric-based code quality grading
- Updates the problem-level report
Directory Structure Expected
SUBMISSION_PATH/
├── checkpoint_1/
│ └── snapshot/
│ └── <agent code>
├── checkpoint_2/
│ └── snapshot/
│ └── <agent code>
└── ...
Examples
Basic evaluation:
slop-code eval-problem outputs/my_run/file_backup
Specify problem name explicitly:
slop-code eval-problem outputs/renamed_dir -p file_backup
With rubric grading:
slop-code eval-problem outputs/my_run/file_backup \
--rubric configs/rubrics/code_quality.jsonl \
--rubric-model claude-sonnet-4-20250514 \
--rubric-provider ANTHROPIC
Custom snapshot directory:
slop-code eval-problem outputs/my_run/file_backup --snapshot-dir code
See Also
- eval - Evaluate a full run directory
- eval-snapshot - Evaluate a single checkpoint snapshot
- metrics judge - Batch rubric grading