README.md

April 24, 2026 · View on GitHub

SlopCodeBench (SCBench)

SlopCodeBench evaluates coding agents under iterative specification refinement: the agent implements a spec, then extends its own code as the spec changes. This exposes behaviors that single-shot benchmarks cannot measure, including path dependence, non-convergence, and trade-offs between explicit handling and structural stability. We release SCBench as an open, community-driven evaluation primitive rather than a finalized benchmark.

Problem definitions now live in the separate scb-problems repository and are also available as a Harbor dataset. We actively want more problems; follow the creating a problem guide and open a PR there.

Note

This is an initial release. We're actively developing and welcome feedback via GitHub Issues.

Prerequisites

Before installing, ensure you have:

Python 3.12+ installed
Docker installed and running (Get Docker)
An API key for your chosen agent (e.g., Anthropic, OpenAI, Google)
8GB+ RAM recommended for running evaluations
10GB+ disk space for Docker images and workspaces

🚀 Install

curl -LsSf https://astral.sh/uv/install.sh | sh
git clone https://github.com/SprocketLab/slop-code-bench.git && cd slop-code-bench && uv sync
export ANTHROPIC_API_KEY="your-key"

# Run!
uv run slop-code run \
  --agent claude_code \
  --model anthropic/opus-4.5 \
  --environment configs/environments/docker-python3.12-uv.yaml \
  --prompt configs/prompts/just-solve.jinja \
  --problem file_backup \
  --problem execution_server \
  thinking=low \
  version=2.0.51

Parameter Reference:

thinking=none|low|medium|high - Controls extended thinking budget based on agent.
version=X.Y.Z - Agent version to use.

Results are saved to:

outputs/opus-4.5/claude_code-just-solve_low_{timestamp}/

First Run: Docker images build automatically for that VERSION of the agent (5-10 minutes). Subsequent runs are faster.

Troubleshooting

Docker not found:

# Check Docker is running
docker ps
# If not running, start Docker Desktop or daemon

API key not found:

# Verify your environment variable is set
echo $ANTHROPIC_API_KEY
# Or pass it directly
ANTHROPIC_API_KEY="your-key" uv run slop-code run ...

Out of disk space:

# Clean up old Docker images
docker system prune -a

For more issues, see GitHub Issues.

📊 Evaluation

Evaluate a run:

slop-code eval outputs/your-run-directory/

Grade code quality with LLM judge:

slop-code metrics judge \
  --rubric configs/rubrics/llm_judge.jsonl \
  --model <model on openrouter> \
  --criteria-template configs/rubrics/templates/criteria_with_pn.j2 \
  --prefix-template configs/rubrics/templates/no_expl.j2

Contributing

We welcome contributions. Two ways to help:

Add problems — Expand the benchmark with new evaluation scenarios in the scb-problems repository, also published as a Harbor dataset. See the Problem Tutorial and Contributing Guide.
Add agents — Integrate new coding agents. See the Agent Guide and Contributing Guide.

This is early-stage software. Your contributions will shape its direction.

Documentation

Guide	Description
❓ FAQ	Frequently asked questions
📖 Problem Tutorial	Create your first problem (30 min hands-on)
📋 Quick Reference	One-page cheat sheet for problem authoring
🤖 Agent Guide	Configure agents, models, and credentials
🏗️ Architecture	How sessions, workspaces, and runtimes work
✅ Evaluation System	Test cases, adapters, loaders, and verifiers
💡 Problem Design	What makes a good evaluation problem
⚠️ Known Issues	Current limitations and workarounds
📊 Commands	CLI command reference (run, eval, metrics, viz, etc.)

Citing Us

If you found this useful, please cite us as:

@article{Orlanski2025SlopCodeBench,
  author = {Orlanski, Gabriel and Roy, Devjeet and Yun, Alexander and Shin, Changho and Gu, Alex and Ge, Albert and Adila, Dyah and Albarghouthi, Aws and Sala, Frederic},
  title = {{SlopCodeBench: Measuring Code Erosion Under Iterative Specification Refinement}},
  journal = {arXiv preprint arXiv:2603.24755},
  year = {2025},
  url = {https://arxiv.org/abs/2603.24755}
}