utils

April 24, 2026 ยท View on GitHub

Utility commands for maintenance and data processing.

Subcommands

CommandDescription
repopulate-diffsRegenerate diff.json files
backfill-reportsBackfill checkpoint reports
backfill-categoriesBackfill rubric categories
compress-artifactsCompress agent artifacts
combine-resultsCombine results from multiple runs
render-promptsRender prompt templates
migrate-eval-formatMigrate evaluation.json format
make-registryGenerate problem registry JSONL

repopulate-diffs

Regenerate diff.json files for checkpoint snapshots.

Usage

slop-code utils repopulate-diffs [OPTIONS] RUN_DIR

Arguments

ArgumentRequiredDescription
RUN_DIRYesPath to the run directory

Options

OptionTypeDefaultDescription
-p, --problem-namestring-Filter to a specific problem name

Behavior

For each checkpoint, generates a diff comparing to the previous checkpoint's snapshot directory. The first checkpoint is compared against an empty snapshot (all files marked as created).

Output

Displays a summary table with:

  • Problem name
  • Checkpoint name
  • Files created/modified/deleted
  • Lines added/removed

Examples

# Regenerate diffs for all problems in a run
slop-code utils repopulate-diffs outputs/my_run

# Regenerate diffs for a specific problem
slop-code utils repopulate-diffs outputs/my_run -p file_backup

backfill-reports

Generate checkpoint reports for all problems in a results directory.

Quick Start

# Single run
slop-code utils backfill-reports outputs/my_run

# Collection of runs
slop-code utils backfill-reports outputs/all_runs --type collection

Usage

slop-code utils backfill-reports [OPTIONS] RESULTS_DIR

Arguments

ArgumentRequiredDescription
RESULTS_DIRYesPath to results directory or collection

Options

OptionTypeDefaultDescription
-t, --typeenumrunPath type: run or collection

Behavior

  1. Scans results directory for problem runs
  2. Loads problem configuration for each
  3. Generates report entries for each checkpoint
  4. Updates checkpoint_results.jsonl
  5. Backfills AST-grep category data
  6. Generates result.json summary

Examples

# Backfill single run
slop-code utils backfill-reports outputs/my_run

# Backfill all runs in collection
slop-code utils backfill-reports outputs/all_runs --type collection

backfill-categories

Backfill category information into existing rubric.jsonl files.

Usage

slop-code utils backfill-categories [OPTIONS] RESULTS_DIR

Arguments

ArgumentRequiredDescription
RESULTS_DIRYesPath to run directory or collection directory

Options

OptionTypeDefaultDescription
-r, --rubricpathrequiredPath to rubric JSONL file with category definitions
-t, --typeenumrunPath type: run or collection

Behavior

Reads rubric definitions from the specified JSONL file and adds the category field to each grade in rubric.jsonl files based on matching the criteria field to the rubric item name. Grades that already have a category field are not modified.

Examples

# Backfill categories for a single run
slop-code utils backfill-categories \
  --rubric configs/rubrics/llm_judge.jsonl \
  outputs/my_run

# Backfill categories for a collection of runs
slop-code utils backfill-categories \
  --rubric configs/rubrics/llm_judge.jsonl \
  --type collection \
  outputs/

compress-artifacts

Compress agent artifact directories into tar.gz files.

Usage

slop-code utils compress-artifacts [OPTIONS] RESULTS_DIR

Arguments

ArgumentRequiredDescription
RESULTS_DIRYesPath to results directory or collection directory

Options

OptionTypeDefaultDescription
-t, --typeenumrunPath type: run or collection
-n, --dry-runflagfalseShow what would be compressed without doing it

Behavior

Finds all uncompressed agent directories (named agent) within checkpoint directories, compresses them into agent.tar.gz, and removes the original directories.

Examples

# Compress artifacts in a single run
slop-code utils compress-artifacts outputs/my_run

# Compress artifacts across all runs in a collection
slop-code utils compress-artifacts outputs --type collection

# Preview what would be compressed
slop-code utils compress-artifacts outputs/my_run --dry-run

combine-results

Combine checkpoint_results.jsonl from multiple runs into one JSONL file with run metadata.

Quick Start

slop-code utils combine-results outputs/runs -o outputs/combined.jsonl

Usage

slop-code utils combine-results [OPTIONS] RUNS_DIR

Arguments

ArgumentRequiredDescription
RUNS_DIRYesPath to run directory or parent of runs

Options

OptionTypeDefaultDescription
-o, --outputpath<RUNS_DIR>/combined_checkpoint_results.jsonlOutput file path
-w, --overwriteflagfalseOverwrite if output exists

Behavior

  1. Discovers all run directories
  2. Loads checkpoint_results.jsonl from each
  3. Attaches run-level metadata to each record:
    • run_name, run_dir, run_relative_path
    • model_name, model_provider
    • prompt_name, thinking_level
    • agent_type, agent_version
    • environment_name, environment_type
  4. Writes combined JSONL file

Examples

# Combine with default output
slop-code utils combine-results outputs/runs

# Custom output location
slop-code utils combine-results outputs/runs -o analysis/all_results.jsonl

# Overwrite existing
slop-code utils combine-results outputs/runs -o outputs/combined.jsonl --overwrite

render-prompts

Render prompt templates for problems to an output directory.

Usage

slop-code utils render-prompts [OPTIONS]

Options

OptionTypeDefaultDescription
-p, --promptstringrequiredPrompt template name or path
-e, --environmentstringrequiredEnvironment config name or path
-o, --output-dirpathrequiredOutput directory for rendered prompts
--problemstring-Problem name(s) to render (repeatable)

Behavior

For each problem, iterates through all checkpoints and renders the prompt template with the checkpoint's spec text and appropriate context.

Output structure:

output_dir/
    problem_name/
        part_1.md
        part_2.md
        ...

Examples

# Render prompts for all problems
slop-code utils render-prompts \
  -p just-solve \
  -e docker-python3.12-uv \
  -o outputs/rendered_prompts

# Render for specific problems
slop-code utils render-prompts \
  -p just-solve \
  -e docker-python3.12-uv \
  -o outputs/rendered_prompts \
  --problem file_backup \
  --problem etl_pipeline

migrate-eval-format

Migrate evaluation.json files from the old list-based tests format to the new grouped format.

Usage

slop-code utils migrate-eval-format [OPTIONS] RESULTS_DIR

Arguments

ArgumentRequiredDescription
RESULTS_DIRYesPath to results directory or collection directory

Options

OptionTypeDefaultDescription
-t, --typeenumrunPath type: run for single run, collection for multiple runs
-n, --dry-runflagfalseShow what would be migrated without making changes

Behavior

Converts the old tests format (list of test objects) to the new grouped format (dict with checkpoint-GroupType keys):

Old format:

"tests": [
  {"id": "test_foo", "checkpoint": "checkpoint_1", "group_type": "Core", "status": "passed"},
  {"id": "test_bar", "checkpoint": "checkpoint_1", "group_type": "Core", "status": "failed"}
]

New format:

"tests": {
  "checkpoint_1-Core": {
    "passed": ["test_foo"],
    "failed": ["test_bar"]
  }
}

When run on a collection, processes all discovered run directories in sequence.

Output

Displays a summary table with:

  • Number of files found
  • Number of files migrated
  • Number of files already migrated
  • Number of failed migrations

Examples

# Migrate a single run directory
slop-code utils migrate-eval-format outputs/my_run

# Preview migration on a collection (dry run)
slop-code utils migrate-eval-format --type collection --dry-run outputs/

# Migrate all runs in a collection
slop-code utils migrate-eval-format --type collection outputs/

make-registry

Generate a registry JSONL file containing metadata for all problems. This is used by the slopcodebench.github.io website.

Usage

slop-code utils make-registry [OPTIONS]

Options

OptionTypeDefaultDescription
-o, --outputpathrequiredOutput JSONL file path
--problemstring-Problem name(s) to include (repeatable)

Behavior

Scans the problems directory and generates a registry entry for each problem. Each line in the output file is a complete JSON object containing:

  • Problem name, tags, version, author, category
  • Difficulty, description, entry point
  • List of all checkpoints with their specs

When --problem is specified, only those problems are included in the registry. Useful for testing specific problems before full registry generation.

Output Format

Each line is a JSON object with structure:

{
  "problem_name": "file_backup",
  "tags": ["filesystem", "backup"],
  "version": 1,
  "author": "author_name",
  "category": "category_name",
  "entry_point": "main.py",
  "adapter_type": "cli",
  "difficulty": "medium",
  "description": "Problem description",
  "checkpoints": [
    {
      "checkpoint_name": "checkpoint_1",
      "spec": "Full spec text...",
      "version": 1,
      "state": "Core Tests"
    }
  ]
}

Examples

# Generate registry for all problems
slop-code utils make-registry -o registry.jsonl

# Generate registry for specific problems only
slop-code utils make-registry -o registry.jsonl --problem file_backup --problem code_search

# Generate to custom location
slop-code utils make-registry --output /path/to/registry.jsonl

See Also

  • metrics - Calculate metrics on submissions
  • eval - Evaluate agent results