hermes-eval

May 16, 2026 ยท View on GitHub

hermes-eval is a deterministic Python CLI for regression testing Hermes Agent skills and scoring Hermes ShareGPT trajectories before they are used for Atropos training data.

The project gives Hermes Agent a practical quality gate:

  • run static regression tests for an agentskills.io-style SKILL.md
  • score batch-run trajectories without LLM calls or external APIs
  • filter weak trajectories before export
  • convert high-quality runs into Atropos-compatible reward records
  • snapshot skill changes with a local Git history

Installation

pip install hermes-eval

For local development:

pip install -e ".[dev]"
pytest tests/ -v

Five-minute Quickstart

hermes-eval --help

Expected output includes these top-level commands:

Commands:
  diff
  export
  skill
  traj

Run the bundled example skill tests:

hermes-eval skill test --skill examples/web-research --verbose

Expected result: a Rich table showing Score 1.00, Grade A, Passed 4, and Failed 0.

Score a good trajectory:

hermes-eval traj score --input tests/fixtures/sample_trajectories/good_run.json --json

Expected result: JSON containing good-run-001 with a score above 0.7.

Filter a trajectory folder:

hermes-eval traj filter --input tests/fixtures/sample_trajectories/ --min-score 0.7 --json

Expected result: only good-run-001 remains.

Export to Atropos:

hermes-eval export atropos \
  --input tests/fixtures/sample_trajectories/ \
  --output .hermes-eval/atropos_out.json \
  --min-score 0.7 \
  --json

Expected result: one record with reward_signal equal to quality_score.

Track skill diffs without modifying the real Hermes installation:

hermes-eval diff \
  --skill-dir examples/web-research \
  --history-dir .hermes-eval/skill-history \
  --json

Atropos Integration

Atropos consumes trajectories with a reward signal. hermes-eval maps each deterministic trajectory quality score directly to reward_signal, keeping the training signal simple and auditable. Low-quality runs can be filtered before they enter Nous Research's RL pipeline.

The exported record includes:

  • trajectory_id
  • quality_score
  • grade
  • original ShareGPT messages
  • extracted tool_calls
  • reward_signal
  • metadata including model, scorer version, timestamp, and threshold

GitHub Actions CI

Copy .github/workflows/skill-eval.yml into a skill repository. It runs hermes-eval skill test --skill . --json --fail-on-regression on pull requests that touch Markdown skills or tests.yaml.

The --fail-on-regression flag exits with code 1 when any static test fails, which makes it suitable for PR gating.

Skill Test Format

Skills follow the agentskills.io convention: a Markdown SKILL.md file plus a sibling tests.yaml. Tests are deterministic and offline. They check routing terms, expected tool mentions, forbidden phrases, and declared constraints without calling an LLM.

Contributing

Keep modules narrow:

  • skill/ owns skill loading, static tests, scoring, and diff snapshots
  • trajectory/ owns ShareGPT loading, scoring, and filtering
  • atropos/ owns export conversion
  • cli.py is the integration layer
  • report/renderer.py owns terminal and JSON rendering

Run before opening a PR:

pytest tests/ -v --tb=short
hermes-eval --help
bash examples/run_eval.sh