Evaluation System Documentation

December 22, 2025 · View on GitHub

The evaluation module provides a pytest-based framework for testing agent submissions. It executes pytest tests against submission code, categorizes results by type, and generates structured reports.

30-Second Overview

Tests are standard pytest files in problems/{problem}/tests/. The PytestRunner executes them via uvx (for isolation), parses results, and categorizes tests using pytest markers:

Unmarked tests in current checkpoint = CORE (must pass)
@pytest.mark.functionality = FUNCTIONALITY (nice-to-have)
@pytest.mark.error = ERROR (edge cases)
Tests from prior checkpoints = REGRESSION (prevent breakage)

Documentation Guide

Getting Started

New to the evaluation system? Start with Architecture Overview
Need to configure a problem? See Configuration Guide

Implementation

Understanding results? See Reporting Guide
Debugging failures? Try Troubleshooting Guide

Core Concepts at a Glance

Concept	Description
Problem	Top-level benchmark containing checkpoints and test files
Checkpoint	A milestone with associated pytest tests
PytestRunner	Orchestrates pytest execution via uvx
TestResult	Individual test outcome with categorization
CorrectnessResults	Aggregated results for a checkpoint
GroupType	Test category: CORE, FUNCTIONALITY, REGRESSION, ERROR
PassPolicy	Criteria for checkpoint success (e.g., "core-cases")
Marker	Pytest decorator for test categorization

Test File Structure

problems/{problem}/
├── config.yaml              # Problem configuration
└── tests/
    ├── conftest.py          # Shared fixtures (entrypoint, checkpoint)
    ├── test_checkpoint_1.py # Tests for checkpoint 1
    ├── test_checkpoint_2.py # Tests for checkpoint 2
    ├── data/                # Test case data (YAML/JSON)
    └── assets/              # Static test files

Common Workflows

Evaluating a Submission

from slop_code.evaluation import run_checkpoint_pytest
from slop_code.evaluation import ProblemConfig
from slop_code.execution import EnvironmentSpec

problem = ProblemConfig.from_yaml(Path("problems/file_backup"))
checkpoint = problem.checkpoints["checkpoint_1"]
env = EnvironmentSpec.from_yaml(Path("configs/environments/docker-python3.12-uv.yaml"))

results = run_checkpoint_pytest(
    submission_path=Path("outputs/submission/checkpoint_1"),
    problem=problem,
    checkpoint=checkpoint,
    env_spec=env,
)

print(f"Passed: {results.passes_policy('core-cases')}")

Checking Results

# Check if all CORE tests passed
if results.passes_policy("core-cases"):
    print("Checkpoint passed!")

# Inspect individual test results
for test in results.tests:
    print(f"{test.id}: {test.status} ({test.group_type})")

# Get counts by group type
print(f"Core: {results.pass_counts.get(GroupType.CORE, 0)}/{results.total_counts.get(GroupType.CORE, 0)}")

Key Exports

from slop_code.evaluation import (
    run_checkpoint_pytest,  # Main entry point
    ProblemConfig,          # Problem configuration
    CheckpointConfig,       # Checkpoint configuration
    GroupType,              # CORE, FUNCTIONALITY, REGRESSION, ERROR
    PassPolicy,             # Pass criteria enum
    CorrectnessResults,     # Aggregated results
    TestResult,             # Individual test result
)

Additional Resources

Code Location: src/slop_code/evaluation/
Example Problem: problems/file_backup/ (good reference for test structure)
Main Entry Point: slop_code.evaluation.pytest_runner.run_checkpoint_pytest()

Version History

v2.0 (2025-12-22): Complete overhaul for pytest-based evaluation system
v1.4 (2025-12-10): Previous Adapter/Loader/Verifier system (deprecated)