PluginEval: Quality Evaluation Framework

March 26, 2026 · View on GitHub

PluginEval is a three-layer quality evaluation framework for Claude Code plugins and skills. It combines deterministic static analysis, LLM-based semantic judging, and Monte Carlo simulation to produce calibrated quality scores with confidence intervals.

Overview

PluginEval answers the question: "How good is this plugin or skill?" It evaluates across 10 quality dimensions, detects anti-patterns, assigns letter grades, and awards quality badges (Bronze through Platinum).

Architecture

┌─────────────────────────────────────────────────┐
│                   CLI / Commands                │
│       score · certify · compare · init          │
├─────────────────────────────────────────────────┤
│                   Eval Engine                   │
│         Composite scoring, layer blending       │
├────────────┬────────────────┬───────────────────┤
│  Layer 1   │    Layer 2     │     Layer 3       │
│  Static    │   LLM Judge    │   Monte Carlo     │
│  Analysis  │   (Semantic)   │   (Statistical)   │
│  <2s, free │  ~30s, 4 calls │  ~2min, 50 calls  │
├────────────┴────────────────┴───────────────────┤
│                  Parser Layer                   │
│       SKILL.md, agents/*.md, plugin.json        │
├─────────────────────────────────────────────────┤
│              Statistical Methods                │
│    Wilson CI · Bootstrap CI · Clopper-Pearson    │
│    Cohen's κ · Coefficient of Variation         │
├─────────────────────────────────────────────────┤
│              Corpus & Elo Ranking               │
│    Gold standard index · Pairwise comparison    │
└─────────────────────────────────────────────────┘

Installation & Setup

PluginEval lives in plugins/plugin-eval/ and uses uv for dependency management.

cd plugins/plugin-eval

# Install core dependencies (static analysis only)
uv sync

# Install with LLM support (Layers 2 & 3)
uv sync --extra llm

# Install with direct API support
uv sync --extra api

# Install dev dependencies (tests, linting)
uv sync --extra dev

Requirements

  • Python ≥ 3.12
  • Core: pydantic, typer, rich, pyyaml
  • LLM layers: claude-agent-sdk (uses Claude Code Max plan by default)
  • API alternative: anthropic SDK (requires ANTHROPIC_API_KEY)

CLI Commands

score — Evaluate a plugin or skill

# Quick evaluation (static only, instant)
uv run plugin-eval score path/to/skill --depth quick

# Standard evaluation (static + LLM judge)
uv run plugin-eval score path/to/skill --depth standard

# Deep evaluation (all three layers)
uv run plugin-eval score path/to/skill --depth deep

# Output formats
uv run plugin-eval score path/to/skill --output json
uv run plugin-eval score path/to/skill --output markdown
uv run plugin-eval score path/to/skill --output html

# CI gate: exit code 1 if below threshold
uv run plugin-eval score path/to/skill --threshold 70

Options:

OptionDefaultDescription
--depthstandardquick, standard, deep, thorough
--outputmarkdownjson, markdown, html
--verbosefalseShow detailed output
--concurrency4Max concurrent LLM calls (1–20)
--authmaxAuth mode: max (Claude Code Max plan) or api-key
--thresholdnoneMinimum score; exit 1 if below

certify — Full certification with badge

Runs at deep depth (all three layers). Takes 15–20 minutes.

uv run plugin-eval certify path/to/skill --output markdown

compare — Head-to-head comparison

Compare two skills side-by-side across all dimensions.

uv run plugin-eval compare path/to/skill-a path/to/skill-b

init — Initialize corpus

Build a gold-standard corpus index from a plugins directory for Elo ranking.

uv run plugin-eval init plugins/ --corpus-dir ~/.plugineval/corpus

Claude Code Integration

PluginEval is also a Claude Code plugin with agents and commands.

Slash Commands

CommandDescription
/eval <path>Evaluate a plugin or skill (orchestrates static + judge)
/certify <path>Full certification pipeline with badge
/compare <a> <b>Head-to-head skill comparison

Agents

AgentModelRole
eval-orchestratorOpusCoordinates evaluation: runs CLI, dispatches judge, computes composite
eval-judgeSonnetLLM judge: scores 4 semantic dimensions with anchored rubrics

Skill

The evaluation-methodology skill provides the full scoring methodology reference, including dimension definitions, rubric anchors, blend weights, and improvement guidance.

The Three Evaluation Layers

Layer 1: Static Analysis

Speed: < 2 seconds. Cost: Free (no LLM calls). Deterministic.

Runs six structural sub-checks against the parsed SKILL.md:

Sub-checkWeightWhat it measures
frontmatter_quality35%Name, description length, trigger-phrase quality ("Use when…", "Use PROACTIVELY")
orchestration_wiring25%Output/input documentation, code examples, orchestrator anti-pattern
progressive_disclosure15%Line count vs. sweet spot (200–600 lines), references/ and assets/ directories
structural_completeness10%Heading density, code blocks, examples section, troubleshooting section
token_efficiency10%MUST/NEVER/ALWAYS density, duplicate-line detection
ecosystem_coherence5%Cross-references to other skills/agents, "related"/"see also" mentions

Also detects anti-patterns (see below) and applies a multiplicative penalty.

Layer 2: LLM Judge

Speed: ~30 seconds. Cost: 4 LLM calls (Haiku + Sonnet). Requires claude-agent-sdk.

Uses Claude as a semantic evaluator across 4 dimensions with anchored rubrics:

DimensionModelMethod
triggering_accuracyHaikuGenerates 10 synthetic prompts (5 should-trigger, 5 should-not), computes F1
orchestration_fitnessSonnetRates worker-vs-orchestrator role using 5-point anchored rubric
output_qualitySonnetSimulates 3 realistic tasks, evaluates expected output quality
scope_calibrationSonnetRates scope appropriateness using 5-point anchored rubric

All 4 assessments run concurrently with semaphore-based throttling.

Layer 3: Monte Carlo Simulation

Speed: ~2 minutes (50 runs) to ~5 minutes (100 runs). Cost: 50–100 LLM calls. Requires claude-agent-sdk.

Generates 15 varied prompts via Haiku, then runs N simulations to compute statistical reliability:

MetricMeasureStatistical Method
Activation rate% of runs where skill activatedWilson score CI
Output consistencyMean quality + coefficient of variationBootstrap CI (1000 resamples)
Failure rate% of runs that erroredClopper-Pearson exact CI
Token efficiencyMedian tokens, IQR, outlier detectionNormalized against 8000-token cap

Evaluation Depths

DepthLayersConfidence LabelTimeCost
quickStatic onlyEstimated< 2sFree
standardStatic + JudgeAssessed~30s4 LLM calls
deepStatic + Judge + Monte Carlo (50 runs)Certified~3 min~54 LLM calls
thoroughStatic + Judge + Monte Carlo (100 runs)Certified+~6 min~104 LLM calls

The 10 Quality Dimensions

Each dimension has a weight and receives scores from different layers, blended using per-dimension weights:

DimensionWeightStaticJudgeMonte CarloWhat it measures
triggering_accuracy25%0.150.250.60Does the description fire for the right prompts?
orchestration_fitness20%0.100.700.20Is it a composable worker, not an orchestrator?
output_quality15%0.000.400.60Would it produce correct, useful output?
scope_calibration12%0.300.550.15Is the scope well-sized for its domain?
progressive_disclosure10%0.800.200.00Does it use references/ for large content?
token_efficiency6%0.400.100.50Is it concise without repetition?
robustness5%0.000.200.80Does it handle varied inputs reliably?
structural_completeness3%0.900.100.00Does it have headings, code, examples?
code_template_quality2%0.300.700.00Are code examples production-ready?
ecosystem_coherence2%0.850.150.00Does it link to related skills/agents?

Composite Score Formula

$ \text{Final} = Σ(\text{dimension\_weight} \times \text{blended\_score}) \times 100 \times \text{anti\_pattern\_penalty} $

Where blended_score for each dimension is a weighted combination of available layer scores, renormalized to the layers actually present.

Quality Badges

BadgeScoreEloStarsMeaning
Platinum≥ 90≥ 1600★★★★★Reference quality
Gold≥ 80≥ 1500★★★★Production ready
Silver≥ 70≥ 1400★★★Functional, needs polish
Bronze≥ 60≥ 1300★★Minimum viable

Badges require both score AND Elo thresholds when Elo data is available.

Letter Grades

Scores are also converted to letter grades:

GradeScore Range
A+≥ 97
A≥ 93
A-≥ 90
B+≥ 87
B≥ 83
B-≥ 80
C+≥ 77
C≥ 73
C-≥ 70
D+≥ 67
D≥ 63
D-≥ 60
F< 60

Anti-Pattern Detection

The static analyzer detects these anti-patterns, each with a severity that contributes to a multiplicative penalty:

FlagSeverityTrigger
OVER_CONSTRAINED10%> 15 MUST/ALWAYS/NEVER directives
EMPTY_DESCRIPTION10%Description < 20 characters
MISSING_TRIGGER15%No "Use when…" trigger phrase in description
BLOATED_SKILL10%> 800 lines without a references/ directory
ORPHAN_REFERENCE5%Dead link to a file in references/
DEAD_CROSS_REF5%Cross-reference to a non-existent skill/agent

Penalty formula: penalty = max(0.5, 1.0 − 0.05 × count) — each anti-pattern reduces the score by 5%, flooring at 50%.

Elo Ranking System

For relative quality comparison against a corpus of known skills:

  • Initial rating: 1500
  • K-factor: 32
  • Confidence intervals: Bootstrap resampling (500 resamples)
  • Corpus management: init command indexes all skills from a plugins directory
  • Reference selection: Matches by category and similar line count

The Elo system uses the standard formula: E(A) = 1 / (1 + 10^((Rb - Ra) / 400)).

Corpus Management

The corpus is a JSON index of all skills used for Elo comparisons:

# Build corpus from your plugins directory
uv run plugin-eval init plugins/ --corpus-dir ~/.plugineval/corpus

# The corpus stores:
# - Skill name, path, category, line count
# - Current Elo rating (updated after each comparison)

Reference skills are selected by matching category and approximate line count.

Statistical Methods

PluginEval uses rigorous statistical methods throughout:

MethodUsed ForDetails
Wilson score CIActivation rate confidenceHandles small-sample binomial proportions
Bootstrap CIOutput quality confidence1000 resamples, percentile method
Clopper-PearsonFailure rate confidenceExact CI for small failure counts
Coefficient of variationOutput consistencystd/mean ratio; lower = more consistent
Cohen's kappaInter-rater agreementFor multi-judge scenarios

All statistical functions are pure Python with no external dependencies (no scipy/numpy required).

Parser

The parser extracts structured data from Claude Code plugin files:

  • Skills: Parses SKILL.md frontmatter (name, description), counts headings, code blocks, languages, MUST/NEVER/ALWAYS directives, cross-references, and detects references/ and assets/ directories
  • Agents: Parses agent .md frontmatter (name, description, model, tools), detects proactive triggers and skill references
  • Plugins: Aggregates all skills and agents from a plugin directory

Project Structure

plugins/plugin-eval/
├── .claude-plugin/
│   └── plugin.json              # Claude Code plugin manifest
├── agents/
│   ├── eval-orchestrator.md     # Orchestrates evaluation (Opus)
│   └── eval-judge.md            # LLM judge agent (Sonnet)
├── commands/
│   ├── eval.md                  # /eval slash command
│   ├── certify.md               # /certify slash command
│   └── compare.md               # /compare slash command
├── skills/
│   └── evaluation-methodology/
│       ├── SKILL.md             # Full methodology reference
│       └── references/
│           └── rubrics.md       # Detailed rubric anchors
├── src/plugin_eval/
│   ├── __init__.py
│   ├── cli.py                   # Typer CLI (score, certify, compare, init)
│   ├── engine.py                # Eval engine (layer coordination, composite scoring)
│   ├── models.py                # Pydantic models (Depth, Badge, EvalConfig, results)
│   ├── parser.py                # Plugin/skill/agent parser
│   ├── reporter.py              # JSON/Markdown/HTML output
│   ├── corpus.py                # Gold standard corpus for Elo ranking
│   ├── elo.py                   # Elo rating calculator with bootstrap CI
│   ├── stats.py                 # Statistical methods (Wilson, bootstrap, Clopper-Pearson)
│   └── layers/
│       ├── __init__.py
│       ├── static.py            # Layer 1: deterministic structural analysis
│       ├── judge.py             # Layer 2: LLM semantic evaluation
│       └── monte_carlo.py       # Layer 3: statistical reliability simulation
├── tests/                       # Comprehensive test suite
│   ├── conftest.py
│   ├── test_cli.py
│   ├── test_engine.py
│   ├── test_static.py
│   ├── test_judge.py
│   ├── test_monte_carlo.py
│   ├── test_models.py
│   ├── test_parser.py
│   ├── test_reporter.py
│   ├── test_corpus.py
│   ├── test_elo.py
│   ├── test_stats.py
│   └── test_e2e.py              # End-to-end tests against real plugins
├── pyproject.toml               # uv/hatch project config
└── uv.lock

Running Tests

cd plugins/plugin-eval

# Run all tests
uv run pytest

# Run with coverage
uv run pytest --cov=plugin_eval

# Run specific test file
uv run pytest tests/test_static.py

# Run e2e tests (requires real plugin corpus)
uv run pytest tests/test_e2e.py

Example Output

Markdown Report

# PluginEval Report

**Path:** `plugins/python-development/skills/async-python-patterns`
**Timestamp:** 2025-03-26T12:00:00+00:00
**Depth:** standard

## Overall Score

| Metric | Value |
|--------|-------|
| Score | **78.3/100** |
| Confidence | Assessed |
| Badge | Silver |

## Layer Breakdown

| Layer | Score | Anti-Patterns |
|-------|-------|---------------|
| static | 0.742 | 0 |
| judge | 0.811 | 0 |

## Dimension Scores

| Dimension | Weight | Score | Grade |
|-----------|--------|-------|-------|
| Triggering Accuracy | 25% | 0.850 | B |
| Orchestration Fitness | 20% | 0.780 | C+ |
| Output Quality | 15% | 0.820 | B- |
| Scope Calibration | 12% | 0.750 | C |
| Progressive Disclosure | 10% | 0.600 | D- |
| Token Efficiency | 6% | 0.910 | A- |
| ...

Tooling

  • Package manager: uv
  • Linter/formatter: ruff (target Python 3.12, line length 100)
  • Type checker: ty
  • Test framework: pytest with pytest-asyncio
  • Build system: hatchling