Guardian Eval Guide

May 29, 2026 · View on GitHub

The eval harness measures whether Guardian got better — across releases, agent prompt changes, KB updates, and new tool wrappers. It lives at evals/ with three independent tiers.

Tiers

Tier	What it measures	Cost	Marker
Unit	Tool `parse_output` correctness on golden fixtures	free	(default)
Workflow	End-to-end findings against dockerised vulnerable apps	docker	`integration`
Agent	LLM hallucination rate, finding precision/recall, CVSS validity, cost/token	API tokens	`agent_eval`

Running

pytest evals/                                     # unit tier only
pytest evals/ -m integration                      # adds workflow tier
pytest evals/ -m agent_eval                       # adds agent tier
pytest evals/ -m "integration or agent_eval"      # both
pytest tests/ evals/                              # everything that's free

Adding fixtures (Tier 1)

Drop two files under evals/fixtures/<tool>/<case>.{input.txt,expected.json}. The runner picks them up automatically — no code changes.

expected.json is a subset match: any keys you don't assert are ignored, so adding new fields to a tool's parsed output won't break old fixtures.

# Capture real tool output
nuclei -u https://example.com -json > evals/fixtures/nuclei/example_run.input.txt

# Author the expected subset
cat > evals/fixtures/nuclei/example_run.expected.json <<'JSON'
{"vulnerabilities": [{"template": "tech-detect:nginx", "severity": "info"}]}
JSON

Adding workflow cases (Tier 2)

Edit EVAL_CASES in evals/test_workflow_integration.py. Each case needs a runnable target (URL or IP), the workflow name, and an expected_findings list of identifiers the workflow MUST surface.

Bring up the docker target out-of-band (compose, k8s, whatever) and run with -m integration. The harness deliberately does NOT manage docker state — that lives in your engagement automation.

Adding agent eval cases (Tier 3)

Append to evals/datasets/<name>.jsonl. One JSON record per line:

{"tool": "nuclei", "raw_output": "<verbatim stdout>", "expected_cves": ["CVE-..."], "expected_severities": ["critical"], "notes": "..."}

Records are public — strip operator-internal IPs, hostnames, credentials.

Reuse evals.scoring.score_binary, score_hallucinations, score_cost, RankingScore so improvements across runs are comparable. Don't print inside scorers — return dataclasses; let the test layer assert.

Acceptance gates by item

Item	Gate
A1 (RAG)	CVE-match rate ≥85% on `evals/datasets/kb_grounding.jsonl`
A2 (Debate)	F1 ≥ single-agent baseline + 5pp on labeled FP/TP set
A3 (Vision)	≥75% correct visual descriptions on 20-case curated set
A4 (Ollama)	Recon workflow completes, structured output validates
A5 (Ranker)	Top-1 accuracy > LLM baseline on held-out splits
A7 (Judge)	Cost ≤30% of round-N at equal quality
B-tier	Workflow integration test produces non-empty findings + valid session JSON