utils
April 24, 2026 ยท View on GitHub
Utility commands for maintenance and data processing.
Subcommands
| Command | Description |
|---|---|
repopulate-diffs | Regenerate diff.json files |
backfill-reports | Backfill checkpoint reports |
backfill-categories | Backfill rubric categories |
compress-artifacts | Compress agent artifacts |
combine-results | Combine results from multiple runs |
render-prompts | Render prompt templates |
migrate-eval-format | Migrate evaluation.json format |
make-registry | Generate problem registry JSONL |
repopulate-diffs
Regenerate diff.json files for checkpoint snapshots.
Usage
slop-code utils repopulate-diffs [OPTIONS] RUN_DIR
Arguments
| Argument | Required | Description |
|---|---|---|
RUN_DIR | Yes | Path to the run directory |
Options
| Option | Type | Default | Description |
|---|---|---|---|
-p, --problem-name | string | - | Filter to a specific problem name |
Behavior
For each checkpoint, generates a diff comparing to the previous checkpoint's snapshot directory. The first checkpoint is compared against an empty snapshot (all files marked as created).
Output
Displays a summary table with:
- Problem name
- Checkpoint name
- Files created/modified/deleted
- Lines added/removed
Examples
# Regenerate diffs for all problems in a run
slop-code utils repopulate-diffs outputs/my_run
# Regenerate diffs for a specific problem
slop-code utils repopulate-diffs outputs/my_run -p file_backup
backfill-reports
Generate checkpoint reports for all problems in a results directory.
Quick Start
# Single run
slop-code utils backfill-reports outputs/my_run
# Collection of runs
slop-code utils backfill-reports outputs/all_runs --type collection
Usage
slop-code utils backfill-reports [OPTIONS] RESULTS_DIR
Arguments
| Argument | Required | Description |
|---|---|---|
RESULTS_DIR | Yes | Path to results directory or collection |
Options
| Option | Type | Default | Description |
|---|---|---|---|
-t, --type | enum | run | Path type: run or collection |
Behavior
- Scans results directory for problem runs
- Loads problem configuration for each
- Generates report entries for each checkpoint
- Updates
checkpoint_results.jsonl - Backfills AST-grep category data
- Generates
result.jsonsummary
Examples
# Backfill single run
slop-code utils backfill-reports outputs/my_run
# Backfill all runs in collection
slop-code utils backfill-reports outputs/all_runs --type collection
backfill-categories
Backfill category information into existing rubric.jsonl files.
Usage
slop-code utils backfill-categories [OPTIONS] RESULTS_DIR
Arguments
| Argument | Required | Description |
|---|---|---|
RESULTS_DIR | Yes | Path to run directory or collection directory |
Options
| Option | Type | Default | Description |
|---|---|---|---|
-r, --rubric | path | required | Path to rubric JSONL file with category definitions |
-t, --type | enum | run | Path type: run or collection |
Behavior
Reads rubric definitions from the specified JSONL file and adds the category
field to each grade in rubric.jsonl files based on matching the criteria
field to the rubric item name. Grades that already have a category field
are not modified.
Examples
# Backfill categories for a single run
slop-code utils backfill-categories \
--rubric configs/rubrics/llm_judge.jsonl \
outputs/my_run
# Backfill categories for a collection of runs
slop-code utils backfill-categories \
--rubric configs/rubrics/llm_judge.jsonl \
--type collection \
outputs/
compress-artifacts
Compress agent artifact directories into tar.gz files.
Usage
slop-code utils compress-artifacts [OPTIONS] RESULTS_DIR
Arguments
| Argument | Required | Description |
|---|---|---|
RESULTS_DIR | Yes | Path to results directory or collection directory |
Options
| Option | Type | Default | Description |
|---|---|---|---|
-t, --type | enum | run | Path type: run or collection |
-n, --dry-run | flag | false | Show what would be compressed without doing it |
Behavior
Finds all uncompressed agent directories (named agent) within checkpoint
directories, compresses them into agent.tar.gz, and removes the original
directories.
Examples
# Compress artifacts in a single run
slop-code utils compress-artifacts outputs/my_run
# Compress artifacts across all runs in a collection
slop-code utils compress-artifacts outputs --type collection
# Preview what would be compressed
slop-code utils compress-artifacts outputs/my_run --dry-run
combine-results
Combine checkpoint_results.jsonl from multiple runs into one JSONL file with run metadata.
Quick Start
slop-code utils combine-results outputs/runs -o outputs/combined.jsonl
Usage
slop-code utils combine-results [OPTIONS] RUNS_DIR
Arguments
| Argument | Required | Description |
|---|---|---|
RUNS_DIR | Yes | Path to run directory or parent of runs |
Options
| Option | Type | Default | Description |
|---|---|---|---|
-o, --output | path | <RUNS_DIR>/combined_checkpoint_results.jsonl | Output file path |
-w, --overwrite | flag | false | Overwrite if output exists |
Behavior
- Discovers all run directories
- Loads
checkpoint_results.jsonlfrom each - Attaches run-level metadata to each record:
run_name,run_dir,run_relative_pathmodel_name,model_providerprompt_name,thinking_levelagent_type,agent_versionenvironment_name,environment_type
- Writes combined JSONL file
Examples
# Combine with default output
slop-code utils combine-results outputs/runs
# Custom output location
slop-code utils combine-results outputs/runs -o analysis/all_results.jsonl
# Overwrite existing
slop-code utils combine-results outputs/runs -o outputs/combined.jsonl --overwrite
render-prompts
Render prompt templates for problems to an output directory.
Usage
slop-code utils render-prompts [OPTIONS]
Options
| Option | Type | Default | Description |
|---|---|---|---|
-p, --prompt | string | required | Prompt template name or path |
-e, --environment | string | required | Environment config name or path |
-o, --output-dir | path | required | Output directory for rendered prompts |
--problem | string | - | Problem name(s) to render (repeatable) |
Behavior
For each problem, iterates through all checkpoints and renders the prompt template with the checkpoint's spec text and appropriate context.
Output structure:
output_dir/
problem_name/
part_1.md
part_2.md
...
Examples
# Render prompts for all problems
slop-code utils render-prompts \
-p just-solve \
-e docker-python3.12-uv \
-o outputs/rendered_prompts
# Render for specific problems
slop-code utils render-prompts \
-p just-solve \
-e docker-python3.12-uv \
-o outputs/rendered_prompts \
--problem file_backup \
--problem etl_pipeline
migrate-eval-format
Migrate evaluation.json files from the old list-based tests format to the new grouped format.
Usage
slop-code utils migrate-eval-format [OPTIONS] RESULTS_DIR
Arguments
| Argument | Required | Description |
|---|---|---|
RESULTS_DIR | Yes | Path to results directory or collection directory |
Options
| Option | Type | Default | Description |
|---|---|---|---|
-t, --type | enum | run | Path type: run for single run, collection for multiple runs |
-n, --dry-run | flag | false | Show what would be migrated without making changes |
Behavior
Converts the old tests format (list of test objects) to the new grouped format (dict with checkpoint-GroupType keys):
Old format:
"tests": [
{"id": "test_foo", "checkpoint": "checkpoint_1", "group_type": "Core", "status": "passed"},
{"id": "test_bar", "checkpoint": "checkpoint_1", "group_type": "Core", "status": "failed"}
]
New format:
"tests": {
"checkpoint_1-Core": {
"passed": ["test_foo"],
"failed": ["test_bar"]
}
}
When run on a collection, processes all discovered run directories in sequence.
Output
Displays a summary table with:
- Number of files found
- Number of files migrated
- Number of files already migrated
- Number of failed migrations
Examples
# Migrate a single run directory
slop-code utils migrate-eval-format outputs/my_run
# Preview migration on a collection (dry run)
slop-code utils migrate-eval-format --type collection --dry-run outputs/
# Migrate all runs in a collection
slop-code utils migrate-eval-format --type collection outputs/
make-registry
Generate a registry JSONL file containing metadata for all problems. This is used by the slopcodebench.github.io website.
Usage
slop-code utils make-registry [OPTIONS]
Options
| Option | Type | Default | Description |
|---|---|---|---|
-o, --output | path | required | Output JSONL file path |
--problem | string | - | Problem name(s) to include (repeatable) |
Behavior
Scans the problems directory and generates a registry entry for each problem. Each line in the output file is a complete JSON object containing:
- Problem name, tags, version, author, category
- Difficulty, description, entry point
- List of all checkpoints with their specs
When --problem is specified, only those problems are included in the registry. Useful for testing specific problems before full registry generation.
Output Format
Each line is a JSON object with structure:
{
"problem_name": "file_backup",
"tags": ["filesystem", "backup"],
"version": 1,
"author": "author_name",
"category": "category_name",
"entry_point": "main.py",
"adapter_type": "cli",
"difficulty": "medium",
"description": "Problem description",
"checkpoints": [
{
"checkpoint_name": "checkpoint_1",
"spec": "Full spec text...",
"version": 1,
"state": "Core Tests"
}
]
}
Examples
# Generate registry for all problems
slop-code utils make-registry -o registry.jsonl
# Generate registry for specific problems only
slop-code utils make-registry -o registry.jsonl --problem file_backup --problem code_search
# Generate to custom location
slop-code utils make-registry --output /path/to/registry.jsonl