CLI Commands Reference

April 24, 2026 · View on GitHub

This section documents all commands available in the slop-code CLI.

Quick Reference

Command	Description
`run`	Run agents on benchmarks with unified config system
`sync`	Install/update the managed problem catalog
`eval`	Evaluate a directory of agent inference results
`eval-problem`	Evaluate a single problem directory
`eval-snapshot`	Evaluate a single snapshot directory
`infer-problem`	Run inference on a single problem
`metrics`	Calculate metrics (static, judge, carry-forward, variance)
`utils`	Utility commands for maintenance and data processing
`docker`	Docker image building utilities
`problems`	Problem inspection and registry commands
`tools`	Interactive tools and case runners
`viz`	Visualization tools (diff viewer)

Global Options

These options are available on all commands:

Option	Type	Default	Description
`-v, --verbose`	flag	0	Increase verbosity (repeatable)
`--seed`	int	42	Random seed
`--overwrite`	flag	false	Overwrite existing output directory
`--debug`	flag	false	Enable debugging mode
`--snapshot-dir-name`	string	`snapshot`	Name of the snapshot directory

Problem catalog location is controlled by the SCBENCH_HOME environment variable. If unset, it defaults to ~/.cache/scbench.

Set SCBENCH_PROBLEMS_PATH to point at a flat local problems directory (each direct child must contain config.yaml) to bypass the managed release catalog for problem-loading commands.

Problem catalog behavior:

First problem-using command bootstraps the latest release if no catalog is installed yet.
Commands do not auto-update once installed; run slop-code sync explicitly.
Resume requires the installed catalog commit to match the run's saved problem_catalog.json metadata.

Command Categories

Core Workflow

Running agents:

# Install/update the managed problem catalog
slop-code sync

# Run with config file
slop-code run --config my_run.yaml --problem file_backup

# Run with CLI flags
slop-code run --model anthropic/sonnet-4.5 --problem file_backup

Evaluating results:

# Evaluate all problems in a run
slop-code eval outputs/my_run

# Evaluate a single problem
slop-code eval-problem outputs/my_run/file_backup

# Evaluate a single snapshot
slop-code eval-snapshot outputs/my_run/file_backup/checkpoint_1/snapshot \
  -o outputs/eval -p file_backup -c 1 -e configs/environments/docker-python3.12-uv.yaml

Metrics and Analysis

# Calculate static code quality metrics
slop-code metrics static outputs/my_run

# Run LLM judge evaluation
slop-code metrics judge outputs/my_run -r configs/rubrics/slop.jsonl -m anthropic/sonnet-4.5

# Compute variance across runs
slop-code metrics variance base outputs/runs -o outputs/variance

Utilities

# Backfill reports for existing runs
slop-code utils backfill-reports outputs/my_run

# Combine results from multiple runs
slop-code utils combine-results outputs/all_runs -o outputs/combined.jsonl

Docker Management

# Build base image
slop-code docker build-base configs/environments/docker-python3.12-uv.yaml

# Build agent image
slop-code docker build-agent configs/agents/claude_code-2.0.51.yaml configs/environments/docker-python3.12-uv.yaml

Problem Inspection

# List all problems
slop-code problems ls

# Check problem conversion status
slop-code problems status file_backup

Test Case Runner

# Run pytest tests for a snapshot
slop-code tools run-case -s outputs/snapshot -p file_backup -c 1 -e configs/environments/docker-python3.12-uv.yaml

Visualization

# Launch diff viewer for a run
slop-code viz diff outputs/my_run

Documentation Index

Document	Description
run.md	Comprehensive guide to `slop-code run` with configuration system
sync.md	Managing the external problem catalog
eval.md	Evaluating run directories
eval-problem.md	Evaluating single problems
eval-snapshot.md	Evaluating single snapshots
infer-problem.md	Running inference on single problems
metrics.md	All metrics subcommands
utils.md	All utility subcommands
docker.md	Docker image management
problems.md	Problem inspection commands
tools.md	Interactive tools
viz.md	Visualization tools