Configuration Guide

December 22, 2025 · View on GitHub

This guide covers problem and checkpoint configuration for the pytest-based evaluation system.

Overview

Configuration is defined in a single config.yaml at the problem root. The pytest runner uses:

ProblemConfig: Entry file, static assets, custom markers, test dependencies
CheckpointConfig: Timeout, environment variables, test inclusion settings

Test categorization is handled by pytest markers, not configuration.

Problem Configuration

Basic Structure

# problems/{problem}/config.yaml
name: file_backup
version: 1
description: "Implement an incremental file backup system"
tags: ["file-system", "cli"]
entry_file: main.py

checkpoints:
  checkpoint_1:
    version: 1
    order: 1
    timeout: 30
  checkpoint_2:
    version: 1
    order: 2
    timeout: 60

ProblemConfig Fields

Field	Type	Required	Description
`name`	string	Yes	Human-friendly problem name
`version`	int	Yes	Version number (increment when tests change)
`description`	string	Yes	Short problem summary
`tags`	list[string]	Yes	Categorization tags (min 1)
`entry_file`	string	Yes	Entry point for running submission (e.g., "main.py")
`author`	string	No	Problem author
`category`	string	No	Problem category
`difficulty`	string	No	"Easy", "Medium", or "Hard"
`static_assets`	dict	No	Named assets for tests
`markers`	dict	No	Custom pytest markers
`test_dependencies`	list	No	Additional packages for tests
`checkpoints`	dict	Yes	Checkpoint configurations

Static Assets

Static assets are files made available to tests during execution:

static_assets:
  sample_data:
    path: ./assets/sample.json
  large_file:
    path: ./assets/large_input.txt

Assets are materialized to tests/assets/ in the workspace and accessible via:

Environment variable: SCBENCH_ASSET_{NAME} (e.g., SCBENCH_ASSET_SAMPLE_DATA)
Environment variable: SCBENCH_ASSETS_DIR (directory containing all assets)

Custom Markers

Define custom pytest markers beyond the built-ins (error, functionality, regression):

markers:
  performance:
    description: "Performance and load tests"
    group: Functionality
  integration:
    description: "Integration tests with external services"
    group: Core

MarkerConfig Fields:

Field	Type	Description
`description`	string	Marker description for pytest.ini
`group`	string	GroupType mapping: "Core", "Functionality", "Error", "Regression"

Test Dependencies

Additional packages needed by tests (beyond the standard set):

test_dependencies:
  - "requests>=2.28"
  - "httpx"
  - "pyyaml"

These are installed via uvx --with=... during test execution.

Standard dependencies (always available):

pytest
pytest-json-ctrf
pytest-json-report
pytest-timeout
jsonschema
deepdiff

Checkpoint Configuration

Basic Structure

checkpoints:
  checkpoint_1:
    version: 1
    order: 1
    timeout: 30
    env:
      DEBUG: "true"
    include_prior_tests: true

CheckpointConfig Fields

Field	Type	Default	Description
`version`	int	Required	Version number (increment when tests change)
`order`	int	Auto	Ordering index (1-indexed, auto-increments)
`timeout`	float	None	Session-level pytest timeout in seconds
`env`	dict	{}	Environment variables for test execution
`include_prior_tests`	bool	true	Whether to run tests from prior checkpoints
`state`	string	"Draft"	Development state: "Draft", "Core Tests", "Full Tests", "Verified"

Environment Variables

Environment variables are merged from problem and checkpoint levels:

# Problem level (inherited by all checkpoints)
env:
  PYTHONPATH: "."
  LOG_LEVEL: "INFO"

checkpoints:
  checkpoint_1:
    env:
      DEBUG: "true"  # Adds to problem-level env

Test Inclusion

The include_prior_tests setting controls which test files are copied to the workspace:

checkpoints:
  checkpoint_1:
    include_prior_tests: true   # Default: runs test_checkpoint_1.py
  checkpoint_2:
    include_prior_tests: true   # Runs test_checkpoint_1.py AND test_checkpoint_2.py
  checkpoint_3:
    include_prior_tests: false  # Only runs test_checkpoint_3.py

When include_prior_tests: true:

Test files for checkpoints 0..N are copied
Tests from prior checkpoints become REGRESSION type automatically
Ensures solutions don't break earlier functionality

When include_prior_tests: false:

Only the current checkpoint's test file is copied
Useful for independent checkpoints

Configuration Inheritance

Child scopes inherit from parent scopes:

ProblemConfig
├── env: {"PYTHONPATH": "."}
├── timeout: 60
│
└── CheckpointConfig (inherits env, timeout)
    ├── env: {"DEBUG": "true"}  # Merged: {"PYTHONPATH": ".", "DEBUG": "true"}
    └── timeout: 30             # Overrides problem timeout

Complete Example

# problems/file_backup/config.yaml
name: file_backup
version: 2
description: "Build an incremental file backup system with change detection"
tags: ["file-system", "cli", "hashing"]
author: "SCBench Team"
category: "File Processing"
difficulty: "Medium"
entry_file: main.py

env:
  PYTHONPATH: "."

static_assets:
  test_files:
    path: ./tests/assets/files

markers:
  hidden:
    description: "Hidden test cases not shown to agent"
    group: Functionality

test_dependencies:
  - "pyyaml>=6.0"

checkpoints:
  checkpoint_1:
    version: 1
    order: 1
    state: "Full Tests"
    timeout: 30
    env:
      LOG_LEVEL: "DEBUG"

  checkpoint_2:
    version: 1
    order: 2
    state: "Full Tests"
    timeout: 45
    include_prior_tests: true

  checkpoint_3:
    version: 1
    order: 3
    state: "Core Tests"
    timeout: 60

  checkpoint_4:
    version: 1
    order: 4
    state: "Draft"
    timeout: 90

Environment Configuration (Runtime Parameter)

Environment configuration is NOT part of ProblemConfig. It is specified at execution time:

slop-code run \
  --agent configs/agents/claude_code/config.yaml \
  --environment configs/environments/docker-python3.12-uv.yaml \
  --problem file_backup

Environment specs live in configs/environments/ and define Docker/local execution settings.

Environment Structure

# configs/environments/docker-python3.12-uv.yaml
type: docker
name: python3.12
docker:
  image: ghcr.io/astral-sh/uv:python3.12-trixie-slim
  workdir: /workspace
  mount_workspace: true

environment:
  env:
    UV_CACHE_DIR: /tmp/uv-cache
  include_os_env: false

setup:
  commands:
    - apt-get update
  eval_commands:
    - uv init

commands:
  entry_file: "{entry_file}.py"
  command: uv run
  agent_command: python

Validation

Configuration is validated using Pydantic models:

Required fields must be present
Types are enforced (string, int, list, dict)
Enum values are validated (GroupType, difficulty, state)
Custom markers must specify valid GroupType

Invalid configurations raise ConfigError with descriptive messages.

Loading Configuration

from slop_code.evaluation import ProblemConfig

# Load from directory
problem = ProblemConfig.from_yaml(Path("problems/file_backup"))

# Access problem fields
print(problem.name)           # "file_backup"
print(problem.entry_file)     # "main.py"
print(problem.markers)        # {"hidden": MarkerConfig(...)}

# Access checkpoints
for name, checkpoint in problem.iterate_checkpoint_items():
    print(f"{name}: timeout={checkpoint.timeout}")

# Get specific checkpoint
cp1 = problem.checkpoints["checkpoint_1"]
print(cp1.timeout)            # 30
print(cp1.include_prior_tests)  # True

Next Steps

Understand architecture: Architecture Guide
Interpret results: Reporting Guide
Debug failures: Troubleshooting Guide