CLI Commands Reference

April 24, 2026 ยท View on GitHub

This section documents all commands available in the slop-code CLI.

Quick Reference

CommandDescription
runRun agents on benchmarks with unified config system
syncInstall/update the managed problem catalog
evalEvaluate a directory of agent inference results
eval-problemEvaluate a single problem directory
eval-snapshotEvaluate a single snapshot directory
infer-problemRun inference on a single problem
metricsCalculate metrics (static, judge, carry-forward, variance)
utilsUtility commands for maintenance and data processing
dockerDocker image building utilities
problemsProblem inspection and registry commands
toolsInteractive tools and case runners
vizVisualization tools (diff viewer)

Global Options

These options are available on all commands:

OptionTypeDefaultDescription
-v, --verboseflag0Increase verbosity (repeatable)
--seedint42Random seed
--overwriteflagfalseOverwrite existing output directory
--debugflagfalseEnable debugging mode
--snapshot-dir-namestringsnapshotName of the snapshot directory

Problem catalog location is controlled by the SCBENCH_HOME environment variable. If unset, it defaults to ~/.cache/scbench.

Set SCBENCH_PROBLEMS_PATH to point at a flat local problems directory (each direct child must contain config.yaml) to bypass the managed release catalog for problem-loading commands.

Problem catalog behavior:

  • First problem-using command bootstraps the latest release if no catalog is installed yet.
  • Commands do not auto-update once installed; run slop-code sync explicitly.
  • Resume requires the installed catalog commit to match the run's saved problem_catalog.json metadata.

Command Categories

Core Workflow

Running agents:

# Install/update the managed problem catalog
slop-code sync

# Run with config file
slop-code run --config my_run.yaml --problem file_backup

# Run with CLI flags
slop-code run --model anthropic/sonnet-4.5 --problem file_backup

Evaluating results:

# Evaluate all problems in a run
slop-code eval outputs/my_run

# Evaluate a single problem
slop-code eval-problem outputs/my_run/file_backup

# Evaluate a single snapshot
slop-code eval-snapshot outputs/my_run/file_backup/checkpoint_1/snapshot \
  -o outputs/eval -p file_backup -c 1 -e configs/environments/docker-python3.12-uv.yaml

Metrics and Analysis

# Calculate static code quality metrics
slop-code metrics static outputs/my_run

# Run LLM judge evaluation
slop-code metrics judge outputs/my_run -r configs/rubrics/slop.jsonl -m anthropic/sonnet-4.5

# Compute variance across runs
slop-code metrics variance base outputs/runs -o outputs/variance

Utilities

# Backfill reports for existing runs
slop-code utils backfill-reports outputs/my_run

# Combine results from multiple runs
slop-code utils combine-results outputs/all_runs -o outputs/combined.jsonl

Docker Management

# Build base image
slop-code docker build-base configs/environments/docker-python3.12-uv.yaml

# Build agent image
slop-code docker build-agent configs/agents/claude_code-2.0.51.yaml configs/environments/docker-python3.12-uv.yaml

Problem Inspection

# List all problems
slop-code problems ls

# Check problem conversion status
slop-code problems status file_backup

Test Case Runner

# Run pytest tests for a snapshot
slop-code tools run-case -s outputs/snapshot -p file_backup -c 1 -e configs/environments/docker-python3.12-uv.yaml

Visualization

# Launch diff viewer for a run
slop-code viz diff outputs/my_run

Documentation Index

DocumentDescription
run.mdComprehensive guide to slop-code run with configuration system
sync.mdManaging the external problem catalog
eval.mdEvaluating run directories
eval-problem.mdEvaluating single problems
eval-snapshot.mdEvaluating single snapshots
infer-problem.mdRunning inference on single problems
metrics.mdAll metrics subcommands
utils.mdAll utility subcommands
docker.mdDocker image management
problems.mdProblem inspection commands
tools.mdInteractive tools
viz.mdVisualization tools

See Also