Metrics System Documentation

March 24, 2026 · View on GitHub

The metrics system automatically measures code quality for agent submissions, tracking everything from lines of code to cyclomatic complexity to code clones.

30-Second Overview

When an agent completes a checkpoint, the metrics system analyzes the submitted code and generates:

  • Line metrics: LOC, comments, total lines
  • Lint metrics: Ruff errors and violations
  • Complexity metrics: Cyclomatic complexity (A-F ratings), nesting depth
  • Pattern violations: AST-grep rule violations across 7 categories
  • Code quality: Waste detection (trivial wrappers, single-use functions), code clones
  • Dependencies: Graph metrics for import relationships

Results are saved to JSON/JSONL files in each checkpoint's quality_analysis/ directory.

Documentation Guide

Understanding Results at Different Levels

Checkpoint-level results (individual checkpoint):

  • Want details on test results and quality metrics for a single checkpoint? See Checkpoint Results - covers evaluation.json and quality_analysis/
  • Contains correctness (test results) and quality (code metrics) for that checkpoint
  • Located in: checkpoint_N/evaluation.json and checkpoint_N/quality_analysis/

Run-level results (aggregated across all checkpoints):

  • Comparing runs or analyzing trends across checkpoints? See Run-Level Results - aggregated statistics
  • Contains solve rates, average costs, efficiency metrics, quality trends
  • Located in: checkpoint_results.jsonl and result.json at run root

All Metrics

Configuration

Core Concepts

ConceptDescription
Checkpoint ResultsCorrectness (tests) + Quality (code metrics) for a single checkpoint
Run SummaryAggregated statistics across all checkpoints and problems
LOCLines of code (source lines, excluding blanks)
Cyclomatic Complexity (CC)Number of independent paths through code (A=1-5, F=41+)
Maintainability Index (MI)Composite score of code maintainability (A >= 19)
AST-grep ViolationsPattern-based slop-rule violations from configs/slop_rules.yaml
WasteAbstraction inefficiencies (trivial wrappers, single-use functions)
ClonesDuplicate code blocks detected via AST hashing
Delta MetricsPercentage changes between checkpoints
Pass RatePercentage of tests passing by category (CORE, FUNCTIONALITY, etc.)
Solve RatePercentage of checkpoints/problems meeting success criteria

Common Questions

What metrics indicate good code quality?

  • CC ratings: More A/B ratings, fewer D/E/F
  • Lint errors: Lower is better
  • AST-grep violations: Lower is better (especially safety/complexity categories)
  • Waste metrics: Fewer trivial wrappers and single-use functions
  • Clone ratio: Lower percentage means less duplication

Where do I find metrics for my run?

Metrics are saved at two levels:

Checkpoint-level (detailed for single checkpoint):

outputs/run_name/problem_name/checkpoint_N/
├── evaluation.json                       # Test results
├── quality_analysis/
│   ├── overall_quality.json              # Aggregated snapshot metrics
│   ├── files.jsonl                       # Per-file metrics
│   ├── symbols.jsonl                     # Per-function/class metrics
│   └── ast_grep.jsonl                    # Pattern violations
└── evaluation/
    ├── stdout.txt, stderr.txt, report.json  # Test artifacts

Run-level (aggregated across all checkpoints):

outputs/run_name/
├── checkpoint_results.jsonl       # All checkpoint metrics in one file
└── result.json                    # Aggregated statistics and summaries

Use checkpoint-level files for detailed analysis of a specific checkpoint. Use run-level files for comparing runs or identifying trends.

How do I compare checkpoints?

Delta metrics (prefixed with delta.) show percentage changes:

  • delta.loc: Lines of code change
  • delta.lint_errors: Lint error change
  • delta.ast_grep_violations: Violation change
  • delta.churn_ratio: Code churn (lines added + removed / prior total)

Code Location

  • Quality metrics computation: src/slop_code/metrics/
    • driver.py: Main entry point for measuring quality
    • languages/: Language-specific parsers (Python, JavaScript, etc.)
    • checkpoint/: Checkpoint-level metrics extraction and delta computation
    • summary/: Run-level aggregation and summary statistics
  • Evaluation (test results): src/slop_code/evaluation/report.py
    • CorrectnessResults: Test result model
    • GroupType: Test categorization (CORE, FUNCTIONALITY, REGRESSION, ERROR)
    • PassPolicy: Success criteria
  • AST-grep rules: configs/slop_rules.yaml
  • Main entry points:
    • Snapshot quality: slop_code.metrics.driver.measure_snapshot_quality()
    • Checkpoint metrics: slop_code.metrics.checkpoint.driver.get_checkpoint_metrics()
    • Run summary: slop_code.metrics.summary.aggregators module

Version History

  • v1.1 (2025-12-26): Added checkpoint-results.md and run-results.md documentation for two-level metrics hierarchy
  • v1.0 (2025-12-17): Initial metrics documentation