utils

April 24, 2026 · View on GitHub

Utility commands for maintenance and data processing.

Subcommands

Command	Description
`repopulate-diffs`	Regenerate diff.json files
`backfill-reports`	Backfill checkpoint reports
`backfill-categories`	Backfill rubric categories
`compress-artifacts`	Compress agent artifacts
`combine-results`	Combine results from multiple runs
`render-prompts`	Render prompt templates
`migrate-eval-format`	Migrate evaluation.json format
`make-registry`	Generate problem registry JSONL

repopulate-diffs

Regenerate diff.json files for checkpoint snapshots.

Usage

slop-code utils repopulate-diffs [OPTIONS] RUN_DIR

Arguments

Argument	Required	Description
`RUN_DIR`	Yes	Path to the run directory

Options

Option	Type	Default	Description
`-p, --problem-name`	string	-	Filter to a specific problem name

Behavior

For each checkpoint, generates a diff comparing to the previous checkpoint's snapshot directory. The first checkpoint is compared against an empty snapshot (all files marked as created).

Output

Displays a summary table with:

Problem name
Checkpoint name
Files created/modified/deleted
Lines added/removed

Examples

# Regenerate diffs for all problems in a run
slop-code utils repopulate-diffs outputs/my_run

# Regenerate diffs for a specific problem
slop-code utils repopulate-diffs outputs/my_run -p file_backup

backfill-reports

Generate checkpoint reports for all problems in a results directory.

Quick Start

# Single run
slop-code utils backfill-reports outputs/my_run

# Collection of runs
slop-code utils backfill-reports outputs/all_runs --type collection

Usage

slop-code utils backfill-reports [OPTIONS] RESULTS_DIR

Arguments

Argument	Required	Description
`RESULTS_DIR`	Yes	Path to results directory or collection

Options

Option	Type	Default	Description
`-t, --type`	enum	`run`	Path type: `run` or `collection`

Behavior

Scans results directory for problem runs
Loads problem configuration for each
Generates report entries for each checkpoint
Updates checkpoint_results.jsonl
Backfills AST-grep category data
Generates result.json summary

Examples

# Backfill single run
slop-code utils backfill-reports outputs/my_run

# Backfill all runs in collection
slop-code utils backfill-reports outputs/all_runs --type collection

backfill-categories

Backfill category information into existing rubric.jsonl files.

Usage

slop-code utils backfill-categories [OPTIONS] RESULTS_DIR

Arguments

Argument	Required	Description
`RESULTS_DIR`	Yes	Path to run directory or collection directory

Options

Option	Type	Default	Description
`-r, --rubric`	path	required	Path to rubric JSONL file with category definitions
`-t, --type`	enum	`run`	Path type: `run` or `collection`

Reads rubric definitions from the specified JSONL file and adds the category field to each grade in rubric.jsonl files based on matching the criteria field to the rubric item name. Grades that already have a category field are not modified.

Examples

# Backfill categories for a single run
slop-code utils backfill-categories \
  --rubric configs/rubrics/llm_judge.jsonl \
  outputs/my_run

# Backfill categories for a collection of runs
slop-code utils backfill-categories \
  --rubric configs/rubrics/llm_judge.jsonl \
  --type collection \
  outputs/

compress-artifacts

Compress agent artifact directories into tar.gz files.

Usage

slop-code utils compress-artifacts [OPTIONS] RESULTS_DIR

Arguments

Argument	Required	Description
`RESULTS_DIR`	Yes	Path to results directory or collection directory

Options

Option	Type	Default	Description
`-t, --type`	enum	`run`	Path type: `run` or `collection`
`-n, --dry-run`	flag	false	Show what would be compressed without doing it

Behavior

Finds all uncompressed agent directories (named agent) within checkpoint directories, compresses them into agent.tar.gz, and removes the original directories.

Examples

# Compress artifacts in a single run
slop-code utils compress-artifacts outputs/my_run

# Compress artifacts across all runs in a collection
slop-code utils compress-artifacts outputs --type collection

# Preview what would be compressed
slop-code utils compress-artifacts outputs/my_run --dry-run

combine-results

Combine checkpoint_results.jsonl from multiple runs into one JSONL file with run metadata.

Quick Start

slop-code utils combine-results outputs/runs -o outputs/combined.jsonl

Usage

slop-code utils combine-results [OPTIONS] RUNS_DIR

Arguments

Argument	Required	Description
`RUNS_DIR`	Yes	Path to run directory or parent of runs

Options

Option	Type	Default	Description
`-o, --output`	path	`<RUNS_DIR>/combined_checkpoint_results.jsonl`	Output file path
`-w, --overwrite`	flag	false	Overwrite if output exists

Behavior

Discovers all run directories
Loads checkpoint_results.jsonl from each
Attaches run-level metadata to each record:
- run_name, run_dir, run_relative_path
- model_name, model_provider
- prompt_name, thinking_level
- agent_type, agent_version
- environment_name, environment_type
Writes combined JSONL file

Examples

# Combine with default output
slop-code utils combine-results outputs/runs

# Custom output location
slop-code utils combine-results outputs/runs -o analysis/all_results.jsonl

# Overwrite existing
slop-code utils combine-results outputs/runs -o outputs/combined.jsonl --overwrite

render-prompts

Render prompt templates for problems to an output directory.

Usage

slop-code utils render-prompts [OPTIONS]

Options

Option	Type	Default	Description
`-p, --prompt`	string	required	Prompt template name or path
`-e, --environment`	string	required	Environment config name or path
`-o, --output-dir`	path	required	Output directory for rendered prompts
`--problem`	string	-	Problem name(s) to render (repeatable)

Behavior

For each problem, iterates through all checkpoints and renders the prompt template with the checkpoint's spec text and appropriate context.

Output structure:

output_dir/
    problem_name/
        part_1.md
        part_2.md
        ...

Examples

# Render prompts for all problems
slop-code utils render-prompts \
  -p just-solve \
  -e docker-python3.12-uv \
  -o outputs/rendered_prompts

# Render for specific problems
slop-code utils render-prompts \
  -p just-solve \
  -e docker-python3.12-uv \
  -o outputs/rendered_prompts \
  --problem file_backup \
  --problem etl_pipeline

migrate-eval-format

Migrate evaluation.json files from the old list-based tests format to the new grouped format.

Usage

slop-code utils migrate-eval-format [OPTIONS] RESULTS_DIR

Arguments

Argument	Required	Description
`RESULTS_DIR`	Yes	Path to results directory or collection directory

Options

Option	Type	Default	Description
`-t, --type`	enum	`run`	Path type: `run` for single run, `collection` for multiple runs
`-n, --dry-run`	flag	false	Show what would be migrated without making changes

Behavior

Converts the old tests format (list of test objects) to the new grouped format (dict with checkpoint-GroupType keys):

Old format:

"tests": [
  {"id": "test_foo", "checkpoint": "checkpoint_1", "group_type": "Core", "status": "passed"},
  {"id": "test_bar", "checkpoint": "checkpoint_1", "group_type": "Core", "status": "failed"}
]

New format:

"tests": {
  "checkpoint_1-Core": {
    "passed": ["test_foo"],
    "failed": ["test_bar"]
  }
}

When run on a collection, processes all discovered run directories in sequence.

Output

Displays a summary table with:

Number of files found
Number of files migrated
Number of files already migrated
Number of failed migrations

Examples

# Migrate a single run directory
slop-code utils migrate-eval-format outputs/my_run

# Preview migration on a collection (dry run)
slop-code utils migrate-eval-format --type collection --dry-run outputs/

# Migrate all runs in a collection
slop-code utils migrate-eval-format --type collection outputs/

make-registry

Generate a registry JSONL file containing metadata for all problems. This is used by the slopcodebench.github.io website.

Usage

slop-code utils make-registry [OPTIONS]

Options

Option	Type	Default	Description
`-o, --output`	path	required	Output JSONL file path
`--problem`	string	-	Problem name(s) to include (repeatable)

Behavior

Scans the problems directory and generates a registry entry for each problem. Each line in the output file is a complete JSON object containing:

Problem name, tags, version, author, category
Difficulty, description, entry point
List of all checkpoints with their specs

When --problem is specified, only those problems are included in the registry. Useful for testing specific problems before full registry generation.

Output Format

Each line is a JSON object with structure:

{
  "problem_name": "file_backup",
  "tags": ["filesystem", "backup"],
  "version": 1,
  "author": "author_name",
  "category": "category_name",
  "entry_point": "main.py",
  "adapter_type": "cli",
  "difficulty": "medium",
  "description": "Problem description",
  "checkpoints": [
    {
      "checkpoint_name": "checkpoint_1",
      "spec": "Full spec text...",
      "version": 1,
      "state": "Core Tests"
    }
  ]
}

Examples

# Generate registry for all problems
slop-code utils make-registry -o registry.jsonl

# Generate registry for specific problems only
slop-code utils make-registry -o registry.jsonl --problem file_backup --problem code_search

# Generate to custom location
slop-code utils make-registry --output /path/to/registry.jsonl