README.md
April 24, 2026 ยท View on GitHub
SlopCodeBench evaluates coding agents under iterative specification refinement: the agent implements a spec, then extends its own code as the spec changes. This exposes behaviors that single-shot benchmarks cannot measure, including path dependence, non-convergence, and trade-offs between explicit handling and structural stability. We release SCBench as an open, community-driven evaluation primitive rather than a finalized benchmark.
Problem definitions now live in the separate scb-problems repository and are also available as a Harbor dataset. We actively want more problems; follow the creating a problem guide and open a PR there.
Note
This is an initial release. We're actively developing and welcome feedback via GitHub Issues.
Prerequisites
Before installing, ensure you have:
- Python 3.12+ installed
- Docker installed and running (Get Docker)
- An API key for your chosen agent (e.g., Anthropic, OpenAI, Google)
- 8GB+ RAM recommended for running evaluations
- 10GB+ disk space for Docker images and workspaces
๐ Install
curl -LsSf https://astral.sh/uv/install.sh | sh
git clone https://github.com/SprocketLab/slop-code-bench.git && cd slop-code-bench && uv sync
export ANTHROPIC_API_KEY="your-key"
# Run!
uv run slop-code run \
--agent claude_code \
--model anthropic/opus-4.5 \
--environment configs/environments/docker-python3.12-uv.yaml \
--prompt configs/prompts/just-solve.jinja \
--problem file_backup \
--problem execution_server \
thinking=low \
version=2.0.51
Parameter Reference:
thinking=none|low|medium|high- Controls extended thinking budget based on agent.version=X.Y.Z- Agent version to use.
Results are saved to:
outputs/opus-4.5/claude_code-just-solve_low_{timestamp}/
First Run: Docker images build automatically for that VERSION of the agent (5-10 minutes). Subsequent runs are faster.
Troubleshooting
Docker not found:
# Check Docker is running
docker ps
# If not running, start Docker Desktop or daemon
API key not found:
# Verify your environment variable is set
echo $ANTHROPIC_API_KEY
# Or pass it directly
ANTHROPIC_API_KEY="your-key" uv run slop-code run ...
Out of disk space:
# Clean up old Docker images
docker system prune -a
For more issues, see GitHub Issues.
๐ Evaluation
Evaluate a run:
slop-code eval outputs/your-run-directory/
Grade code quality with LLM judge:
slop-code metrics judge \
--rubric configs/rubrics/llm_judge.jsonl \
--model <model on openrouter> \
--criteria-template configs/rubrics/templates/criteria_with_pn.j2 \
--prefix-template configs/rubrics/templates/no_expl.j2
Contributing
We welcome contributions. Two ways to help:
- Add problems โ Expand the benchmark with new evaluation scenarios in the scb-problems repository, also published as a Harbor dataset. See the Problem Tutorial and Contributing Guide.
- Add agents โ Integrate new coding agents. See the Agent Guide and Contributing Guide.
This is early-stage software. Your contributions will shape its direction.
Documentation
| Guide | Description |
|---|---|
| โ FAQ | Frequently asked questions |
| ๐ Problem Tutorial | Create your first problem (30 min hands-on) |
| ๐ Quick Reference | One-page cheat sheet for problem authoring |
| ๐ค Agent Guide | Configure agents, models, and credentials |
| ๐๏ธ Architecture | How sessions, workspaces, and runtimes work |
| โ Evaluation System | Test cases, adapters, loaders, and verifiers |
| ๐ก Problem Design | What makes a good evaluation problem |
| โ ๏ธ Known Issues | Current limitations and workarounds |
| ๐ Commands | CLI command reference (run, eval, metrics, viz, etc.) |
Citing Us
If you found this useful, please cite us as:
@article{Orlanski2025SlopCodeBench,
author = {Orlanski, Gabriel and Roy, Devjeet and Yun, Alexander and Shin, Changho and Gu, Alex and Ge, Albert and Adila, Dyah and Albarghouthi, Aws and Sala, Frederic},
title = {{SlopCodeBench: Measuring Code Erosion Under Iterative Specification Refinement}},
journal = {arXiv preprint arXiv:2603.24755},
year = {2025},
url = {https://arxiv.org/abs/2603.24755}
}
