hermes-eval
May 16, 2026 ยท View on GitHub
hermes-eval is a deterministic Python CLI for regression testing Hermes Agent
skills and scoring Hermes ShareGPT trajectories before they are used for Atropos
training data.
The project gives Hermes Agent a practical quality gate:
- run static regression tests for an agentskills.io-style
SKILL.md - score batch-run trajectories without LLM calls or external APIs
- filter weak trajectories before export
- convert high-quality runs into Atropos-compatible reward records
- snapshot skill changes with a local Git history
Installation
pip install hermes-eval
For local development:
pip install -e ".[dev]"
pytest tests/ -v
Five-minute Quickstart
hermes-eval --help
Expected output includes these top-level commands:
Commands:
diff
export
skill
traj
Run the bundled example skill tests:
hermes-eval skill test --skill examples/web-research --verbose
Expected result: a Rich table showing Score 1.00, Grade A, Passed 4,
and Failed 0.
Score a good trajectory:
hermes-eval traj score --input tests/fixtures/sample_trajectories/good_run.json --json
Expected result: JSON containing good-run-001 with a score above 0.7.
Filter a trajectory folder:
hermes-eval traj filter --input tests/fixtures/sample_trajectories/ --min-score 0.7 --json
Expected result: only good-run-001 remains.
Export to Atropos:
hermes-eval export atropos \
--input tests/fixtures/sample_trajectories/ \
--output .hermes-eval/atropos_out.json \
--min-score 0.7 \
--json
Expected result: one record with reward_signal equal to quality_score.
Track skill diffs without modifying the real Hermes installation:
hermes-eval diff \
--skill-dir examples/web-research \
--history-dir .hermes-eval/skill-history \
--json
Atropos Integration
Atropos consumes trajectories with a reward signal. hermes-eval maps each
deterministic trajectory quality score directly to reward_signal, keeping the
training signal simple and auditable. Low-quality runs can be filtered before
they enter Nous Research's RL pipeline.
The exported record includes:
trajectory_idquality_scoregrade- original ShareGPT
messages - extracted
tool_calls reward_signal- metadata including model, scorer version, timestamp, and threshold
GitHub Actions CI
Copy .github/workflows/skill-eval.yml into a skill repository. It runs
hermes-eval skill test --skill . --json --fail-on-regression on pull requests
that touch Markdown skills or tests.yaml.
The --fail-on-regression flag exits with code 1 when any static test fails,
which makes it suitable for PR gating.
Skill Test Format
Skills follow the agentskills.io convention: a Markdown SKILL.md file plus a
sibling tests.yaml. Tests are deterministic and offline. They check routing
terms, expected tool mentions, forbidden phrases, and declared constraints
without calling an LLM.
Contributing
Keep modules narrow:
skill/owns skill loading, static tests, scoring, and diff snapshotstrajectory/owns ShareGPT loading, scoring, and filteringatropos/owns export conversioncli.pyis the integration layerreport/renderer.pyowns terminal and JSON rendering
Run before opening a PR:
pytest tests/ -v --tb=short
hermes-eval --help
bash examples/run_eval.sh