Guardian Eval Guide

May 29, 2026 · View on GitHub

The eval harness measures whether Guardian got better — across releases, agent prompt changes, KB updates, and new tool wrappers. It lives at evals/ with three independent tiers.

Tiers

TierWhat it measuresCostMarker
UnitTool parse_output correctness on golden fixturesfree(default)
WorkflowEnd-to-end findings against dockerised vulnerable appsdockerintegration
AgentLLM hallucination rate, finding precision/recall, CVSS validity, cost/tokenAPI tokensagent_eval

Running

pytest evals/                                     # unit tier only
pytest evals/ -m integration                      # adds workflow tier
pytest evals/ -m agent_eval                       # adds agent tier
pytest evals/ -m "integration or agent_eval"      # both
pytest tests/ evals/                              # everything that's free

Adding fixtures (Tier 1)

Drop two files under evals/fixtures/<tool>/<case>.{input.txt,expected.json}. The runner picks them up automatically — no code changes.

expected.json is a subset match: any keys you don't assert are ignored, so adding new fields to a tool's parsed output won't break old fixtures.

# Capture real tool output
nuclei -u https://example.com -json > evals/fixtures/nuclei/example_run.input.txt

# Author the expected subset
cat > evals/fixtures/nuclei/example_run.expected.json <<'JSON'
{"vulnerabilities": [{"template": "tech-detect:nginx", "severity": "info"}]}
JSON

Adding workflow cases (Tier 2)

Edit EVAL_CASES in evals/test_workflow_integration.py. Each case needs a runnable target (URL or IP), the workflow name, and an expected_findings list of identifiers the workflow MUST surface.

Bring up the docker target out-of-band (compose, k8s, whatever) and run with -m integration. The harness deliberately does NOT manage docker state — that lives in your engagement automation.

Adding agent eval cases (Tier 3)

Append to evals/datasets/<name>.jsonl. One JSON record per line:

{"tool": "nuclei", "raw_output": "<verbatim stdout>", "expected_cves": ["CVE-..."], "expected_severities": ["critical"], "notes": "..."}

Records are public — strip operator-internal IPs, hostnames, credentials.

Scoring primitives

Reuse evals.scoring.score_binary, score_hallucinations, score_cost, RankingScore so improvements across runs are comparable. Don't print inside scorers — return dataclasses; let the test layer assert.

Acceptance gates by item

ItemGate
A1 (RAG)CVE-match rate ≥85% on evals/datasets/kb_grounding.jsonl
A2 (Debate)F1 ≥ single-agent baseline + 5pp on labeled FP/TP set
A3 (Vision)≥75% correct visual descriptions on 20-case curated set
A4 (Ollama)Recon workflow completes, structured output validates
A5 (Ranker)Top-1 accuracy > LLM baseline on held-out splits
A7 (Judge)Cost ≤30% of round-N at equal quality
B-tierWorkflow integration test produces non-empty findings + valid session JSON