Evaluation System Architecture
January 6, 2026 · View on GitHub
This guide provides a detailed overview of the pytest-based evaluation system architecture.
System Overview
The evaluation system uses PytestRunner to execute standard pytest tests against agent submissions. Tests run in isolated environments via uvx, ensuring no interference with the solution's dependencies.
Problem
├── config.yaml # Problem + checkpoint configuration
└── tests/
├── conftest.py # Required fixtures
├── test_checkpoint_1.py # Tests for checkpoint 1
├── test_checkpoint_2.py # Tests for checkpoint 2
└── ...
Execution Flow
High-Level Flow
┌─────────────────┐
│ ProblemConfig │
│ + Checkpoint │
└────────┬────────┘
│
▼
┌─────────────────┐
│ PytestRunner │
│ │
│ 1. Copy tests │
│ 2. Gen pytest.ini│
│ 3. Build command│
│ 4. Execute uvx │
│ 5. Parse reports│
│ 6. Categorize │
└────────┬────────┘
│
▼
┌─────────────────┐
│ CorrectnessResults │
│ + TestResult[] │
└─────────────────┘
Detailed Execution Steps
-
Test Preparation
- Copy test files from
problems/{problem}/tests/to workspace - Only copy tests for checkpoints 0..N based on
include_prior_testssetting - Generate
pytest.iniwith marker registration
- Copy test files from
-
Command Construction
- Build
uvxcommand with test dependencies - Pass
--entrypointand--checkpointoptions to pytest - Configure CTRF and pytest-json-report output paths
- Build
-
Isolated Execution
- Execute via Docker runtime
uvxinstalls test dependencies in ephemeral environment- Solution's virtualenv/dependencies are not affected
-
Report Parsing
- Parse CTRF JSON report (pytest-json-ctrf plugin)
- Parse pytest-json-report for detailed failure messages
- Prefer pytest-json-report (expands parametrized tests)
-
Test Categorization
- Determine
GroupTypefor each test based on:- Pytest markers (
@pytest.mark.error,@pytest.mark.functionality) - Which checkpoint the test file belongs to
- Pytest markers (
- Aggregate counts by
GroupType
- Determine
-
Result Generation
- Create
TestResultfor each test - Build
CorrectnessResultswith aggregated statistics
- Create
Core Components
PytestRunner (pytest_runner.py)
The main orchestrator class that manages the full evaluation lifecycle:
runner = PytestRunner(
problem=problem_config,
checkpoint=checkpoint_config,
environment=env_spec,
submission_path=Path("outputs/submission/checkpoint_1"),
)
results = runner.run()
Key Methods:
_copy_tests_from_problem(): Selectively copy test files_generate_pytest_ini(): Create pytest.ini with markers_build_pytest_command(): Construct uvx + pytest command_parse_ctrf_report(): Parse CTRF JSON output_determine_group_type(): Categorize tests by markers/checkpointrun(): Main orchestration method
Test Categorization Logic
The _determine_group_type() method implements categorization rules:
def _determine_group_type(test_checkpoint, markers, current_checkpoint):
is_current = test_checkpoint == current_checkpoint
# Rule 1: Prior checkpoint tests ALWAYS become regression (regardless of markers)
if not is_current:
return GroupType.REGRESSION
# Rule 2: Error marker wins for current checkpoint
if "error" in markers:
return GroupType.ERROR
# Rule 3: Explicit regression marker
if "regression" in markers:
return GroupType.REGRESSION
# Rule 4: Check custom markers from problem config
for marker in markers:
if marker in problem.markers:
return problem.markers[marker].group
# Rule 5: Functionality marker
if "functionality" in markers:
return GroupType.FUNCTIONALITY
# Rule 6: Default to CORE
return GroupType.CORE
Result Models (report.py)
TestResult: Individual test outcome
class TestResult(BaseModel):
id: str # Pytest nodeid
checkpoint: str # "checkpoint_1", etc.
group_type: GroupType # CORE, FUNCTIONALITY, etc.
status: Literal["passed", "failed", "skipped", "error"]
duration_ms: float
file_path: str
markers: list[str]
failure_message: str | None
CorrectnessResults: Aggregated checkpoint results
class CorrectnessResults(BaseModel):
problem_name: str
checkpoint_name: str
duration: float
tests: list[TestResult]
pass_counts: dict[GroupType, int]
total_counts: dict[GroupType, int]
pytest_exit_code: int
infrastructure_failure: bool
Pytest Command Structure
The runner builds a command like:
uvx \
--with=pytest \
--with=pytest-json-ctrf \
--with=pytest-json-report \
--with=pytest-timeout \
--with=jsonschema \
--with=deepdiff \
pytest \
--timeout=30 \
--entrypoint='python main.py' \
--checkpoint='checkpoint_1' \
--ctrf=.scbench/ctrf-report.json \
--json-report \
--json-report-file=.scbench/pytest-report.json \
-vv \
.evaluation_tests
Key Options:
--entrypoint: Command to run submission (passed to test fixtures)--checkpoint: Current checkpoint name--timeout: Session-level timeout from checkpoint config--ctrf: CTRF JSON output path--json-report: Enable detailed failure reports
Test Dependencies
Tests run with these packages installed via uvx:
| Package | Purpose |
|---|---|
pytest | Test framework |
pytest-json-ctrf | CTRF JSON report generation |
pytest-json-report | Detailed failure messages |
pytest-timeout | Session-level timeout |
jsonschema | JSON schema validation |
deepdiff | Deep comparison utilities |
Additional dependencies can be specified per-problem via test_dependencies in config.
Exit Codes
| Code | Meaning | Infrastructure Failure? |
|---|---|---|
| 0 | All tests passed | No |
| 1 | Some tests failed | No |
| 2 | Interrupted | Yes |
| 3 | Internal error | Yes |
| 4 | Usage error | Yes |
| 5 | No tests collected | Yes |
Infrastructure failures set results.infrastructure_failure = True and cause all pass policies to fail.
Design Patterns
uvx Isolation
Tests run via uvx to ensure complete isolation:
- Solution's dependencies don't interfere with test environment
- Works with any solution package manager (pip, uv, poetry)
- Test dependencies installed fresh each run
Marker-Based Categorization
Instead of directory-based grouping, tests use pytest markers:
- Built-in:
@pytest.mark.error,@pytest.mark.functionality,@pytest.mark.regression - Custom markers defined in problem config
- Automatic regression detection for prior checkpoint tests
Checkpoint Test Selection
When include_prior_tests=True (default):
- Copy test files for checkpoints 0..N (current)
- Prior checkpoint tests automatically become REGRESSION type
- Ensures solutions don't break earlier functionality
Integration Points
- Execution Module:
SessionandEnvironmentSpecfor Docker runtime - Configuration:
ProblemConfigandCheckpointConfigfromconfig.py - CLI:
slop-code evalcommand usesrun_checkpoint_pytest()
Next Steps
- Configure problems: Configuration Guide
- Understand results: Reporting Guide
- Debug failures: Troubleshooting Guide