SkillFortifyBench

April 24, 2026 · View on GitHub

A 540-skill, 3-format benchmark for evaluating AI agent skill supply chain security scanners.

About This Release

This release is a deterministic execution of the benchmark specification in Appendix B of Formal Analysis and Supply Chain Security for Agentic AI Skills (Bhardwaj 2026, arXiv:2603.00195). All 540 skills are produced byte-identically via python -m benchmarks.generator --seed 42 per paper Section B.3 Principle 2. The novelty of this artifact is limited to engineering reproducibility (Docker pin, hash-pinned dependencies, SOURCE_DATE_EPOCH); conceptual contribution rests in the paper. See Citations and Prior Work below for adjacent benchmarks.

Overview

SkillFortifyBench provides 540 skills across Claude (.md), MCP (.json), and OpenClaw (.yaml) formats — 270 malicious (13 attack types, A1-A13) and 270 benign (5 categories) — generated deterministically from seed=42.

SkillFortify is part of the broader effort in AI Reliability Engineering: building tooling that makes agent ecosystems trustworthy by default.

Quick Start

pip install skillfortify
PYTHONHASHSEED=0 python -m benchmarks.generator --output ./benchmark-output --seed 42

Or via Docker (recommended for byte-identical reproduction):

docker run --rm \
    --security-opt seccomp=seccomp-default.json \
    --cap-drop=ALL \
    --read-only \
    --tmpfs /tmp:rw,nosuid,nodev,noexec \
    -v "$(pwd)/results:/bench/results:rw" \
    skillfortify-bench:v1.0.0

skills/claude/ — 180 Claude skills (90 malicious + 90 benign)
skills/mcp/ — 180 MCP server configs (90 malicious + 90 benign)
skills/openclaw/ — 180 OpenClaw skills (90 malicious + 90 benign)
manifest.json — 540-entry manifest with SHA-256 hashes
attack_taxonomy.json — A1-A13 attack type taxonomy

Attack Type Distribution (Table 11)

Type	Claude	MCP	OpenClaw	Total	Description
A1	10	10	10	30	HTTP exfiltration
A2	6	6	6	18	DNS exfiltration
A3	10	10	10	30	Prompt injection
A4	10	10	10	30	Tool poisoning
A5	6	6	6	18	Credential theft
A6	6	6	6	18	Privilege escalation
A7	8	8	8	24	Arbitrary code execution
A8	8	8	8	24	Indirect prompt injection
A9	8	8	8	24	Shadow tool registration
A10	4	4	4	12	Dependency confusion
A11	4	2	2	8	Skill squatting
A12	2	4	2	8	Typosquatting
A13	8	8	10	26	Multi-vector composite
Malicious	90	90	90	270	—
Benign	90	90	90	270	5 categories
Total	180	180	180	540	—

Reproduction

Every run with seed=42 and PYTHONHASHSEED=0 produces byte-identical skill files. The manifest_content_sha256 field in manifest.json can be used to verify deterministic reproduction.

Expected metrics (skillfortify 0.4.4, medium threshold):

Precision: 100% (270/270)
Recall: 94.07% (254/270) — 16 intentional false negatives
Wilson 95% CI for recall: [0.90592, 0.96320]

Evaluation

Run the SkillFortify scanner against the benchmark:

skillfortify scan benchmark-output/skills/ --format json --severity-threshold medium

Citation

See CITATION.cff or cite:

@article{bhardwaj2025skillfortify,
  title   = {SkillFortify: Static Analysis for AI Agent Skill Supply Chains},
  author  = {Bhardwaj, Varun Pratap},
  journal = {arXiv preprint arXiv:2603.00195},
  year    = {2025},
  doi     = {10.48550/arXiv.2603.00195}
}

Paper DOI: 10.5281/zenodo.18787663

Citations and Prior Work

Static analysis of LLM agent security intersects with several concurrent efforts. SkillFortifyBench is designed as a complement, not a replacement, to these benchmarks:

Holzbauer et al. 2026 — scanner-disagreement measurement across static tooling (arXiv:2603.16572).
SkillClone (Zhu et al., ASE 2026) — clone-detection benchmark for agent skills (arXiv:2603.22447).
MalTool (Hu et al. 2026) — tool-abuse pattern taxonomy (arXiv:2602.12194).
InjecAgent (Zhan et al. 2024) — prompt-injection benchmark (arXiv:2403.02691).
MCPTox (Wang et al., AAAI 2026) — MCP-specific attack corpus (arXiv:2508.14925).
HarmBench (Mazeika et al. 2024) — broad LLM-harm benchmark (arXiv:2402.04249).

Limitations (v1.0)

Synthetic-only generation. No real-world seed tier. Natural-distribution prevalence of A1..A13 is out of scope.
No hard-negative tier. Benign skills are diverse but not adversarially selected.
No cross-scanner leaderboard. Comparison vs. Semgrep / Bandit / Slither remains in private notebooks; public leaderboard deferred to v1.1.
No Gebru-style DATASHEET or MODEL-CARD. v1.1 will ship both.
Single-analyzer-version coverage (skillfortify==0.4.4). v1.1 will cover a version matrix (0.4.4, 0.5.0, ...) with per-version tagged results.
English-only skill bodies. No i18n in v1.0.

v1.1 Roadmap (target: 2026 Q3)

Real-world seed tier (small curated set of sanitized in-the-wild skills).
Hard-negative tier (adversarial benign).
Interactive leaderboard page.
DATASHEET for Datasets (Gebru et al. format).
MODEL-CARD analog for the generator.
Multi-version analyzer matrix.

License

MIT — see LICENSE.

Note: The benchmarks/ subtree is MIT-licensed per paper Section 12; the rest of the SkillFortify repository is Elastic License 2.0.