SCOUT
April 24, 2026 · View on GitHub
SCOUT
Firmware Security Analysis Pipeline with Deterministic Evidence Packaging
Drop a firmware blob. Get SARIF findings, CycloneDX SBOM+VEX, hash-anchored evidence, and analyst-ready reasoning trails -- in one command.
SCOUT is optimized for deep analysis of a single firmware image: it acts as an analyst copilot grounded in evidence lineage, not an autonomous verdict engine. Ghidra P-code taint, adversarial LLM adjudication, reasoning persistence across findings/reports/viewer/TUI, zero pip dependencies.
| 1,123 Corpus Targets (Tier 1 refresh) |
98.8% Success Rate (1110 / 1123) |
146,943 CVE Matches (Tier 1 refresh) |
99.3% LLM-Adjudicated FPR (Tier 2 carry-over) |
Pending Pair-Eval FN/FP (next lane) |
Note
Tier 1 numbers in this README now reflect the fresh v2.6.1 corpus refresh (docs/carry_over_benchmark_v2.6.md): 1,123 targets, 1110 success / 4 partial / 9 fatal. Tier 2 LLM numbers are still carry-over (v2.3.0, 36 firmware) until the pair-eval lane lands. See docs/benchmark_governance.md, docs/carry_over_benchmark_v2.6.md, and benchmarks/baselines/v2.5.0/manifest.json.
Tip
What's new in v2.7.2 (Phase 2C++ detection-engine integrity patch — no scorecard movement expected)
- Phase 2C++.1 —
DECOMPILED_COLOCATED_CAP = 0.45promoted to a named constant. Thedecompiled_colocatedtaint method previously hardcoded a0.50ceiling inline. The 5-tier cap ladder (SYMBOL_COOCCURRENCE 0.40 < DECOMPILED_COLOCATED 0.45 < STATIC_CODE_VERIFIED 0.55 < STATIC_ONLY 0.60 < PCODE_VERIFIED 0.75) is now externally cited. Consumer impact:decompiled_colocatedtraces drop0.50 → 0.45(-0.05); ROC thresholds previously pinned at 0.50 should be retuned to 0.45.priority_scoreandcve_scan'sSTATIC_CODE_VERIFIED_CAP=0.55unchanged. Rationale: the v2.4.0 external review (docs/upgrade_plz.mdGap C) flagged the prior value as over-confident relative to body-text-only evidence. - Phase 2C++.2 — legacy
addr_diff > 16residues removed fromghidra_analysis.pyandghidra_scripts/pcode_taint.py. Commit3352783(v2.4.1) replaced the primary CALL-matching path with callee-name resolution but left a deadtrace_pcode_forward()helper inside_PYGHIDRA_SCRIPTand an unreachableelse: addr_difffallback in the analyzeHeadless Strategy 1 loop. Both are now physically removed;_trace_forward_pcode()'ssource_api_nameparameter is required (no default). No runtime behaviour change — the production paths have done callee-name matching for 13 days already. Guard-rail tests intests/test_ghidra_dead_code_removed.pypin the removal. - No scorecard movement expected. Gap B was runtime-effective since v2.4.1. Gap C's new ceiling only binds on
decompiled_colocatedtraces, which are emitted solely by the pyghidra fallback (ghidra_analysis.py:609) that Ghidra-12 environments don't exercise. Phase 2D' Entry Gate remains at the v2.7.1 figure of record (2/5 PASS). Pair-eval re-measurement is deferred to the session that evaluates Gap A (interprocedural taint) ROI. - Pivot Option D (compliance-led identity) remains in force. v2.7.2 is detection-engine hygiene, not a behavioural pivot. The
compliance_reportstage and four standard mappings shipped in v2.7.0 are unchanged.
Tip
What's new in v2.7.1 (Phase 2C+.4 vendor corpus expansion — quantitative refinement of v2.7.0's scenario C)
- Pair-eval corpus extended 7 → 12 pairs — five new vendor/model entries: D-Link DIR-859 (CVE-2019-17621), D-Link DIR-878 (vendor advisory), ASUS RT-AC68U (CVE-2020-15498), Linksys WRT1900AC v2 (progression), and Linksys EA6700 (progression). Manifest registration alone clears Phase 2D' Entry Gate 5 (corpus ≥ 10).
- Phase 2D' Entry Gate scorecard: 1/5 → 2/5 PASS (Gate 4 Rerun + Gate 5 Corpus). Gate 1 recall improves 0.143 → 0.167 (+17% relative) but still FAIL; Gate 2 (tier variation) and Gate 3 (finding diversity 0.917) remain FAIL. The new TP/FP pair (DIR-859 vuln + patched both hit
aiedge.findings.web.exec_sink_overlap) corroborates the v2.7.0 diagnosis thatfindings.py's single-synthesis-finding selection is the structural Gate 1/3 limit. - Honest measurement-of-record protocol — an intermediate 1st-pass measurement under partial WRT1900AC extractions transiently showed Gate 2 PASS due to
aiedge.findings.analysis_incompletepopulating theunknowntier; the figure of record is the ok-state measurement after the 2400-second budget rerun (Gate 2 reverts to FAIL). Documented indocs/v2.7.1_release_plan.md. - Scorer reliability fix —
scripts/score_pair_corpus.pynow graceful-skips pairs with absent runs (vulnerable_status="missing"/patched_status="missing") instead of aborting onStopIteration, so corpus growth and partial-coverage measurements no longer crash the release gate. - Pivot Option D (compliance-led identity) remains in force — v2.7.1 is a quantitative refinement of v2.7.0's scenario C, not a re-pivot. The
compliance_reportstage and four standard mappings shipped in v2.7.0 are unchanged.
Why SCOUT?
Every finding has a hash-anchored evidence chain. No finding without a file path, byte offset, SHA-256 hash, and rationale. Artifacts are immutable and traceable from firmware blob to final verdict.
4-tier confidence caps with Ghidra P-code verification -- honest scoring. SYMBOL_COOCCURRENCE capped at 0.40, STATIC_CODE_VERIFIED at 0.55, STATIC_ONLY at 0.60, PCODE_VERIFIED at 0.75. Promotion to
confirmedrequires dynamic verification. We don't inflate scores.
SARIF + CycloneDX VEX + SLSA provenance -- standard formats. GitHub Code Scanning, VS Code, CI/CD integration out of the box.
Built for analyst-in-the-loop firmware review. SCOUT is strongest when used to start deep review on a single firmware image quickly, expose evidence paths, and preserve matched reasoning lineage across triage and reporting surfaces. Analyst hints loop back into next-run LLM adjudication via MCP, while final verdict ownership stays with the reviewer.
How It Works
firmware.bin ──> 42-stage pipeline ──> SARIF findings ──> Web viewer
(auto Ghidra) CycloneDX SBOM+VEX TUI dashboard
(auto CVE match) Evidence chain GitHub/VS Code
(optional LLM) SLSA attestation MCP for AI agents
# Full analysis
./scout analyze firmware.bin
# Static-only (no LLM, \$0)
./scout analyze firmware.bin --no-llm
# Pre-extracted rootfs
./scout analyze firmware.img --rootfs /path/to/rootfs
# Web viewer
./scout serve aiedge-runs/<run_id> --port 8080
# TUI dashboard
./scout ti # interactive (latest run)
./scout tw # watch mode (auto-refresh)
# MCP server for AI agents
./scout mcp --project-id aiedge-runs/<run_id>
Comparison
| Feature | SCOUT | FirmAgent | EMBA | FACT | FirmAE |
|---|---|---|---|---|---|
| Scale (firmware tested) | 1,123 | 14 | -- | -- | 1,124 |
| SBOM (CycloneDX 1.6+VEX) | Yes | No | Yes | No | No |
| SARIF 2.1.0 Export | Yes | No | No | No | No |
| Hash-Anchored Evidence Chain | Yes | No | No | No | No |
| SLSA L2 Provenance | Yes | No | No | No | No |
| Known CVE Signature Matching | Yes (2,528 CVEs, 25 sigs) | No | No | No | No |
| Confidence Caps (honest scoring) | Yes | No | No | No | No |
| Ghidra Integration (auto-detect) | Yes | IDA Pro | Yes | No | No |
| AFL++ Fuzzing Pipeline | Yes | Yes | No | No | No |
| Cross-Binary IPC Chains | Yes (5 types) | No | No | No | No |
| Taint Propagation (LLM) | Yes | Yes (DeepSeek) | No | No | No |
| Adversarial FP Reduction | Yes | No | No | No | No |
| MCP Server (AI agent) | Yes | No | No | No | No |
| Web Report Viewer | Yes | No | Yes | Yes | No |
| Zero pip Dependencies | Yes | No | No | No | No |
Key Features
| Feature | Description | |
|---|---|---|
| :package: | SBOM & CVE | CycloneDX 1.6 + VEX + 25 known CVE signatures (8 vendors) + NVD scan + 2,528 local CVE DB + EPSS scoring (FIRST.org API, batched + cached) |
| :mag: | Binary Analysis | Ghidra P-code SSA dataflow taint + ELF hardening (NX/PIE/RELRO/Canary/FORTIFY) + .dynstr detection + 28 sink symbols + format string detection |
| :dart: | Attack Surface | Source→sink tracing, web server auto-detection, cross-binary IPC chains (5 types: unix socket, dbus, shm, pipe, exec) |
| :brain: | Taint Analysis | HTTP-aware inter-procedural taint, P-code SSA dataflow, call chain visualization, 4-strategy fallback (P-code → colocated → decompiled → interprocedural) |
| :robot: | LLM Engine | 4 backends (Codex CLI / Claude API / Claude Code CLI / Ollama) + centralized system prompts + structured JSON output + 5-stage parser (preamble/fence/raw/brace-counting/error-recovery) + temperature control |
| :crossed_swords: | LLM-Adjudicated Debate | Advocate/Critic LLM debate for LLM-adjudicated FPR reduction (99.3% on the Tier 2 carry-over baseline). Separate parse_failures vs llm_call_failures observability with quota_exhausted detection |
| :compass: | Explainability Surface (v2.6.1) | reasoning_trail persisted across findings, analyst Markdown, TUI, and web viewer so reviewers can inspect matched evidence lineage — not just a final score. Advocate / critic / decision / pattern-hit entries with 200-char excerpt redaction |
| :inbox_tray: | Analyst-in-the-loop Channel (v2.6.1) | 4 MCP tools for reasoning lookup, hint injection, verdict override, and category filtering. Hints loop back into next-run advocate prompt via AIEDGE_FEEDBACK_DIR (opt-in, fcntl.flock-safe) |
| :triangular_ruler: | Detection vs Priority Separation (v2.6.0) | confidence stays evidence-bound (≤0.55 static cap); priority_score / priority_inputs capture EPSS, reachability, backport, and CVSS as ranking signals. See docs/scoring_calibration.md |
| :speedboat: | Parallel DAG Execution (v2.6.0, PoC) | --experimental-parallel [N] opt-in level-wise stage parallelism (ThreadPoolExecutor + Kahn topo levels). 15 levels / max-width 7 on the 42-stage pipeline. Sequential path unchanged |
| :shield: | Security Assessment | X.509 cert scan, boot service audit, filesystem permission checks, credential mapping, hardcoded secret detection |
| :test_tube: | Fuzzing (optional) | AFL++ with CMPLOG, persistent mode, NVRAM faker, harness generation, crash triage |
| :bug: | Emulation | 4-tier (FirmAE / Pandawan+FirmSolo / QEMU user-mode / rootfs inspection) + GDB remote debug |
| :electric_plug: | MCP Server | 12 tools via Model Context Protocol for Claude Code/Desktop integration |
| :bar_chart: | Web Viewer | Glassmorphism dashboard with KPI bar, IPC map, risk heatmap, interactive evidence navigation |
| :link: | Evidence Chain | SHA-256 anchored artifacts + 4-tier confidence caps (0.40/0.55/0.60/0.75) + 5-tier exploit promotion ladder |
| :scroll: | Standard Output | SARIF 2.1.0 (GitHub Code Scanning) + CycloneDX 1.6 + VEX + SLSA Level 2 in-toto attestation |
| :gear: | CI/CD Integration | GitHub Action (.github/actions/scout-scan/) with composite Docker action + automatic SARIF upload to GitHub Security tab |
| :scales: | Regulatory Alignment | Output formats compatible with EU CRA Annex I (docs/compliance_mapping/cra_annex_i.md); SBOM output compatible with FDA Section 524B guidance; output formats compatible with ISO 21434 / UN R155 |
| :chart_with_upwards_trend: | Benchmarking | FirmAE dataset (1,123 firmware), analyst-readiness scoring, verifier-backed archive bundles, TP/FP analysis scripts |
| :key: | Vendor Decrypt | D-Link SHRS AES-128-CBC auto-decryption; Shannon entropy encryption detection (>7.9); binwalk v3 compatibility |
| :white_check_mark: | Zero Dependencies | Pure Python 3.10+ stdlib only — no pip dependencies, air-gap friendly deployment |
Analyst Copilot Surfaces
Explainability surface
reasoning_trailand evidence lineage are preserved across findings, analyst Markdown, TUI, web viewer, and SARIF properties.- This is where reviewers inspect why a finding was downgraded, upheld, or promoted.
Analyst-in-the-loop channel
- MCP tools and
AIEDGE_FEEDBACK_DIRare the supported override/hint path. - Human hints are allowed to influence the next run; final ownership still stays with the analyst.
Autonomous reasoning (future)
- SCOUT is not positioned as a fully autonomous exploit agent in v2.6.1.
- Multi-agent exploit chains, pair-grounded evaluation loops, and autonomous fuzz harness generation remain Phase 2D / reviewer-eval lane work.
Pipeline (42 Stages)
Firmware --> Unpack --> Profile --> Inventory --> Ghidra --> Semantic Classification
--> SBOM --> CVE Scan --> Reachability --> Endpoints --> Surfaces
--> Enhanced Source --> C-Source ID --> Taint Propagation
--> FP Verification --> Adversarial Triage
--> Graph --> Attack Surface --> Findings
--> LLM Triage --> LLM Synthesis --> Emulation --> [Fuzzing]
--> PoC Refinement --> Chain Construction --> Exploit Chain --> PoC --> Verification
Ghidra is auto-detected and enabled by default. Stages in [brackets] require optional external tools (AFL++/Docker).
Pipeline Stages Reference (42)
| Stage | Module | Purpose | LLM? | Cost |
|---|---|---|---|---|
tooling | tooling.py | External tool availability check (binwalk, Ghidra, Docker) | No | $0 |
extraction | extraction.py | Firmware unpacking (binwalk + vendor_decrypt + Shannon entropy detection) | No | $0 |
structure | structure.py | Filesystem structure analysis | No | $0 |
carving | carving.py | File carving from unstructured regions | No | $0 |
firmware_profile | firmware_profile.py | Architecture, kernel, init system fingerprinting | No | $0 |
inventory | inventory.py | Per-binary ELF hardening + symbol extraction | No | $0 |
ghidra_analysis | ghidra_analysis.py | Decompilation + P-code SSA dataflow analysis | No | $0 |
semantic_classification | semantic_classifier.py | 3-pass function classifier (static → haiku → sonnet) | Yes | Low |
sbom | sbom.py | CycloneDX 1.6 SBOM generation with VEX | No | $0 |
cve_scan | cve_scan.py | NVD + 25 known signatures + EPSS enrichment | No | $0 |
reachability | reachability.py | BFS-based call-graph reachability | No | $0 |
endpoints | endpoints.py | Network endpoint discovery | No | $0 |
surfaces | surfaces.py | Attack surface enumeration | No | $0 |
enhanced_source | enhanced_source.py | Web server auto-detection + INPUT_APIS scan (21 APIs) | No | $0 |
csource_identification | csource_identification.py | HTTP input source identification via static sentinel + QEMU | No | $0 |
taint_propagation | taint_propagation.py | Inter-procedural taint with 28 sinks + format string detection | Yes | Medium |
fp_verification | fp_verification.py | 3-pattern FP removal + LLM verification with parse/call failure separation | Yes | Low |
adversarial_triage | adversarial_triage.py | Advocate/Critic LLM debate (LLM-adjudicated FPR reduction, 99.3%) | Yes | Medium |
graph | graph.py | Communication graph (5 IPC edge types) | No | $0 |
attack_surface | attack_surface.py | Attack surface mapping with IPC chains | No | $0 |
attribution | attribution.py | Vendor/firmware attribution | No | $0 |
functional_spec | functional_spec.py | Functional specification extraction | No | $0 |
threat_model | threat_model.py | STRIDE-based threat modeling | No | $0 |
web_ui | web_ui.py | Web UI / CGI endpoint analysis | No | $0 |
findings | findings.py | Finding aggregation + SARIF export | No | $0 |
llm_triage | llm_triage.py | LLM finding triage (haiku/sonnet/opus auto-routing) | Yes | Variable |
llm_synthesis | llm_synthesis.py | LLM finding synthesis | Yes | Medium |
emulation | emulation.py | 4-tier emulation (FirmAE / Pandawan / QEMU / rootfs) | No | $0 |
dynamic_validation | dynamic_validation.py | Dynamic behavior verification | No | $0 |
fuzzing | fuzz_*.py | AFL++ fuzzing with NVRAM faker | No | $0 |
poc_refinement | poc_refinement.py | Iterative PoC generation (5 attempts) | Yes | Medium |
chain_construction | chain_constructor.py | Same-binary + cross-binary IPC exploit chains | No | $0 |
exploit_gate | stage_registry.py | Exploit promotion gate | No | $0 |
exploit_chain | exploit_chain.py | Exploit chain validation | No | $0 |
exploit_autopoc | exploit_autopoc.py | Automated PoC orchestration | Yes | Medium |
poc_validation | poc_validation.py | PoC reproduction validation | No | $0 |
exploit_policy | exploit_policy.py | Final exploit promotion decision | No | $0 |
OTA-specific stages: ota, ota_payload, ota_fs, ota_roots, ota_boottriage, firmware_lineage (Android-style OTA payload analysis).
Benchmarks
Tier 1 (Static, frozen baseline)
Baseline: v2.6.1, 2026-04-17, fresh corpus refresh (docs/carry_over_benchmark_v2.6.md)
1,123firmware /8vendors / 98.8% success rate1,110success /4partial /9fatal3,531findings /146,943CVE matches1,089 / 1,110successful runs produced nonzero CVE output
Tier 2 (LLM-Adjudicated Adversarial Debate, GPT-5.3-Codex)
Baseline: v2.3.0, 2026-04-09, claude-code driver (carry-over; pair-eval lane still pending)
36firmware /9vendors2,430findings debated →2,412downgraded +18maintained- LLM-adjudicated FPR reduction: 99.3% | pair-grounded FN/FP: pending reviewer eval lane
v2.6.0 Post-merge Real-Firmware Validation
This section records post-release real-firmware validation runs, distinct from the carry-over corpus baselines above.
Validation target 1 — Netgear R7000 (codex driver, --experimental-parallel 4)
| Metric | v2.5.0 | v2.6.0 |
|---|---|---|
adversarial_triage parse_failures | 0/100 | 0/100 (100 debated, 97 downgraded, 3 maintained) |
fp_verification unverified | 0/100 | 0/100 (100 verified: 56 TP, 44 FP) |
reasoning_trail_count (top-level findings) | N/A | 0/3 top-level / 100/100 at adversarial_triage + fp_verification artifacts ¹ |
findings with priority_score | N/A | 3/3 (100% additive priority annotation) |
priority_bucket_counts | N/A | {critical: 0, high: 0, medium: 3, low: 0} |
| category distribution | N/A | {vulnerability: 1, pipeline_artifact: 2, misconfiguration: 0, unclassified: 0} |
cve_scan EPSS enriched | 23/23 | 0 (stage skipped — sbom landed partial and cve_scan/reachability skip on sbom dependency failure ²) |
--experimental-parallel 4 wall-clock | N/A | ~170 minutes end-to-end across the registered pipeline (fp_verification dominant at 113 min; no sequential baseline for delta) |
¹ v2.6.0 → v2.6.1 follow-up (commit 7b36274): the top-level synthesis finding (web.exec_sink_overlap) now inherits matched downstream evidence lineage instead of relying only on the stage-level aggregate summary. Matching prefers run-relative binary path, falls back to binary SHA-256, and samples representative downstream trail entries deterministically so the synthesis finding reflects the alerts that actually informed it. This R7000 run reflects the v2.6.0 shipped behaviour.
² v2.6.0 → v2.6.1 follow-up (commit 8e0bb82): the R7000 extraction actually succeeded (1,664 files, 2,412 binaries scanned under squashfs-root), but the SBOM stage returned 0 components on this firmware due to a silent schema mismatch — _collect_so_files_from_inventory read inventory.file_list (a pre-v2.x key no longer emitted) and _detect_from_binary_analysis expected per-entry string_hits (replaced by matched_symbols in the current inventory schema). OpenWrt hid the bug because its opkg database alone contributes 100+ components. The fix makes both helpers walk inventory.roots directly and fall back to reading the binary file contents via a new _extract_ascii_runs helper. A clean re-run of just SbomStage on this R7000 run raises the component count from 0 to 4 (curl 7.36.0 via binary read, plus openssl 1.0.0 / libz 1 / libpthread 0 via .so* walking). Downstream cve_scan / reachability would then produce real CVE + EPSS numbers on a full pipeline re-run.
Validation target 2 — OpenWrt Archer C7 v5 (TP-Link, --no-llm)
| Metric | v2.6.0 |
|---|---|
| total findings | 3 |
reasoning_trail_count | 0 (no-llm: adversarial_triage and fp_verification are LLM-gated; trail is populated only when LLM stages run) |
findings with priority_score | 3 / 3 (100% — additive priority annotation succeeded for all findings) |
priority_bucket_counts | {critical: 0, high: 0, medium: 3, low: 0} |
| category distribution | {vulnerability: 1, pipeline_artifact: 2, misconfiguration: 0, unclassified: 0} (PR #7a 3-category ontology, 0% unclassified rate) |
| notable caveats | OpenWrt is squashfs ext4 root; binwalk extracted cleanly; --no-llm path skipped reasoning_trail generation as expected. Run completed end-to-end through findings stage. |
See CHANGELOG.md for full version history and docs/scoring_calibration.md for the two-score contract.
Architecture
+--------------------------------------------------------------------+
| SCOUT (Evidence Engine) |
| |
| Firmware --> Unpack --> Profile --> Inventory --> SBOM --> CVE |
| | | | | |
| Ghidra Binary Audit 40+ sigs NVD+ |
| auto-detect NX/PIE/etc local DB|
| |
| --> Taint --> FP Filter --> Attack Surface --> Findings |
| (HTTP-aware) (3-pattern) (IPC chains) (SARIF 2.1.0) |
| |
| --> Emulation --> [Fuzzing] --> Exploit Chain --> PoC --> Verify |
| |
| 42 stages . SHA-256 manifests . 4-tier confidence caps (0.40/0.55/0.60/0.75) |
| Outputs: SARIF + CycloneDX VEX + SLSA L2 + Markdown reports |
+--------------------------------------------------------------------+
| Handoff (firmware_handoff.json) |
+--------------------------------------------------------------------+
| Terminator (Orchestrator) |
| LLM Tribunal --> Dynamic Validation --> Verified Chain |
+--------------------------------------------------------------------+
| Layer | Role | Deterministic? |
|---|---|---|
| SCOUT | Evidence production (42 stages) | Yes |
| Handoff | JSON contract between engine and orchestrator | Yes |
| Terminator | LLM tribunal, dynamic validation, exploit dev | No (auditable) |
Exploit Promotion Policy
| Level | Requirements | Placement |
|---|---|---|
dismissed | Critic rebuttal strong or confidence < 0.5 | Appendix only |
candidate | Confidence 0.5-0.8, evidence exists but chain incomplete | Report (flagged) |
high_confidence_static | Confidence >= 0.8, strong static evidence, no dynamic | Report (highlighted) |
confirmed | Confidence >= 0.8 AND >= 1 dynamic verification artifact | Report (top) |
verified_chain | Confirmed AND PoC reproduced 3x in sandbox | Exploit report |
CLI Reference
| Command | Description |
|---|---|
./scout analyze <firmware> | Full 42-stage analysis pipeline |
./scout analyze <firmware> --quiet | Suppress real-time progress output (CI/scripted use) |
./scout analyze-8mb <firmware> | Truncated 8MB canonical track |
./scout stages <run_dir> --stages X,Y | Rerun specific stages |
./scout serve <run_dir> | Launch web report viewer |
./scout mcp [--project-id <id>] | Start MCP stdio server |
./scout tui <run_dir> | Terminal UI dashboard |
./scout ti | TUI interactive (latest run) |
./scout tw | TUI watch mode (auto-refresh) |
./scout to | TUI one-shot (latest run) |
./scout t | TUI default (latest run) |
./scout corpus-validate | Validate corpus manifest |
./scout quality-metrics | Compute quality metrics |
./scout quality-gate | Check quality thresholds |
./scout release-quality-gate | Unified release gate |
Exit codes: 0 success, 10 partial, 20 fatal, 30 policy violation
Benchmarking
# FirmAE dataset benchmark (1,123 usable firmware images in the current frozen baseline)
./scripts/benchmark_firmae.sh --parallel 8 --time-budget 1800 --cleanup
# Options
--dataset-dir DIR # Firmware directory (default: aiedge-inputs/firmae-benchmark)
--results-dir DIR # Output directory
--file-list PATH # Explicit newline-delimited firmware list
--parallel N # Concurrent jobs (default: 4)
--time-budget S # Seconds per firmware (default: 600)
--stages STAGES # Specific stages (default: full pipeline)
--max-images N # Limit images (0 = all)
--llm # Enable LLM-backed stages
--8mb # Use 8MB truncated track
--full # Include dynamic stages
--cleanup # Preserve a verifier-friendly run replica under results/archives/, then delete original run dirs
--dry-run # List files without running
# Analyst-readiness re-evaluation for an existing benchmark-results tree
python3 scripts/reevaluate_benchmark_results.py \
--results-dir benchmark-results/<run>
# Normalize legacy bundles and rerun a stage subset (useful for debugging archive fidelity issues)
python3 scripts/rerun_benchmark_stages.py \
--results-dir benchmark-results/<legacy-run> \
--out-dir benchmark-results/<rerun-out> \
--stages attribution,graph,attack_surface \
--no-llm
# Post-benchmark analysis
PYTHONPATH=src python3 scripts/cve_rematch.py \
--results-dir benchmark-results/firmae-YYYYMMDD_HHMM \
--nvd-dir data/nvd-cache \
--csv-out cve_matches.csv
PYTHONPATH=src python3 scripts/analyze_findings.py \
--results-dir benchmark-results/firmae-YYYYMMDD_HHMM \
--output analysis_report.json
# FirmAE dataset setup
./scripts/unpack_firmae_dataset.sh [ZIP_FILE]
# Tier 1 frozen baseline docs
# - docs/tier1_rebenchmark_frozen_baseline.md
# - docs/tier1_rebenchmark_final_analysis.md
Current benchmark contract
- Archived benchmark bundles are now expected to be run replicas, not flattened JSON snapshots.
- Benchmark quality is reported in two layers:
- analysis rate = pipeline completed (
success + partial) - analyst-ready rate = archived bundle passes analyst/verifier checks and remains evidence-navigable
- analysis rate = pipeline completed (
benchmark-results/legacy/tier2-llm-v2is a legacy snapshot. It is useful for historical reference and re-evaluation, but it should not be used as the canonical analyst-readiness baseline.- The current contract has been validated on a fresh single-sample run (
benchmark-results/tier2-single-fidelity) where both analyst verifiers passed from the archived bundle.
Current LLM quality behavior
llm_triagemodel routing:<=10 haiku,11-50 sonnet,>50 or chain-backed opusllm_triageretries withsonnetif ahaikucall exits non-zerollm_triage,semantic_classification,adversarial_triage, andfp_verificationnow writestages/<stage>/llm_trace/*.json- Parse failures are handled fail-closed: repaired when possible, otherwise reported as degraded/partial instead of silently treated as clean success
Environment Variables
Core
| Variable | Default | Description |
|---|---|---|
AIEDGE_LLM_DRIVER | codex | LLM provider: codex / claude / claude-code / ollama |
ANTHROPIC_API_KEY | -- | API key for Claude driver (not needed for claude-code) |
AIEDGE_OLLAMA_URL | http://localhost:11434 | Ollama server URL |
AIEDGE_LLM_BUDGET_USD | -- | LLM cost budget limit |
AIEDGE_PRIV_RUNNER | -- | Privileged command prefix for dynamic stages |
AIEDGE_FEEDBACK_DIR | aiedge-feedback | Terminator feedback directory |
Ghidra
| Variable | Default | Description |
|---|---|---|
AIEDGE_GHIDRA_HOME | auto-detect | Ghidra install path; probes /opt/ghidra_*, /usr/local/ghidra* |
AIEDGE_GHIDRA_MAX_BINARIES | 20 | Max binaries to analyze |
AIEDGE_GHIDRA_TIMEOUT_S | 300 | Per-binary analysis timeout |
SBOM & CVE
| Variable | Default | Description |
|---|---|---|
AIEDGE_NVD_API_KEY | -- | NVD API key (optional, improves rate limits) |
AIEDGE_NVD_CACHE_DIR | -- | Cross-run NVD response cache |
AIEDGE_SBOM_MAX_COMPONENTS | 500 | Maximum SBOM components |
AIEDGE_CVE_SCAN_MAX_COMPONENTS | 50 | Maximum components to CVE-scan |
AIEDGE_CVE_SCAN_TIMEOUT_S | 30 | Per-request NVD API timeout |
Fuzzing & Emulation
| Variable | Default | Description |
|---|---|---|
AIEDGE_AFLPP_IMAGE | aflplusplus/aflplusplus | AFL++ Docker image |
AIEDGE_FUZZ_BUDGET_S | 3600 | Fuzzing time budget (seconds) |
AIEDGE_FUZZ_MAX_TARGETS | 5 | Max fuzzing target binaries |
AIEDGE_EMULATION_IMAGE | scout-emulation:latest | Emulation Docker image |
AIEDGE_FIRMAE_ROOT | /opt/FirmAE | FirmAE installation path |
AIEDGE_QEMU_GDB_PORT | 1234 | QEMU GDB remote port |
Quality Gates
| Variable | Default | Description |
|---|---|---|
AIEDGE_QG_PRECISION_MIN | 0.9 | Minimum precision threshold |
AIEDGE_QG_RECALL_MIN | 0.6 | Minimum recall threshold |
AIEDGE_QG_FPR_MAX | 0.1 | Maximum false positive rate |
Run Directory Structure
aiedge-runs/<run_id>/
├── manifest.json
├── firmware_handoff.json
├── provenance.intoto.jsonl # SLSA L2 attestation
├── input/firmware.bin
├── stages/
│ ├── extraction/ # Unpacked filesystem
│ ├── inventory/
│ │ └── binary_analysis.json # Per-binary hardening + symbols
│ ├── enhanced_source/
│ │ └── sources.json # HTTP input sources + web server detection
│ ├── sbom/
│ │ ├── sbom.json # CycloneDX 1.6
│ │ └── vex.json # VEX exploitability
│ ├── cve_scan/
│ │ └── cve_matches.json # NVD + known signature matches
│ ├── taint_propagation/
│ │ └── taint_results.json # Taint paths + call chains
│ ├── ghidra_analysis/ # Decompiled functions (optional)
│ ├── chain_construction/
│ │ └── chains.json # Same-binary + cross-binary IPC chains
│ ├── findings/
│ │ ├── findings.json # All findings
│ │ ├── pattern_scan.json # Static pattern matches
│ │ ├── sarif.json # SARIF 2.1.0 export
│ │ └── stage.json # SHA-256 manifest
│ └── ... # 42 stage directories total
└── report/
├── viewer.html # Web dashboard
├── report.json
├── analyst_digest.json
└── executive_report.md
Verification Scripts
# Evidence chain integrity
python3 scripts/verify_analyst_digest.py --run-dir aiedge-runs/<run_id>
python3 scripts/verify_verified_chain.py --run-dir aiedge-runs/<run_id>
# Report schema compliance
python3 scripts/verify_aiedge_final_report.py --run-dir aiedge-runs/<run_id>
python3 scripts/verify_aiedge_analyst_report.py --run-dir aiedge-runs/<run_id>
# Security invariants
python3 scripts/verify_run_dir_evidence_only.py --run-dir aiedge-runs/<run_id>
python3 scripts/verify_network_isolation.py --run-dir aiedge-runs/<run_id>
# Quality gates
./scout release-quality-gate aiedge-runs/<run_id>
Documentation
| Document | Purpose |
|---|---|
| Blueprint | Pipeline architecture and design rationale |
| Status | Current implementation status |
| Artifact Schema | Profiling + inventory contracts |
| Adapter Contract | Terminator-SCOUT handoff protocol |
| Report Contract | Report structure and governance |
| Analyst Digest | Digest schema and verdicts |
| Verified Chain | Evidence requirements |
| Duplicate Gate | Cross-run dedup rules |
| Known CVE Ground Truth | CVE validation dataset |
| Upgrade Plan v2 | v2.0 upgrade plan |
| LLM Roadmap | LLM integration strategy |
Security & Ethics
Authorized environments only.
SCOUT is intended for contracted security audits, vulnerability research (responsible disclosure), and CTF/training in lab environments. Dynamic validation runs in network-isolated sandbox containers. No weaponized payloads are included.
Contributing
- Read Blueprint for architecture context
- Run
pytest -q-- all tests must pass - Lint
ruff check src/-- zero violations - Follow the Stage protocol (
src/aiedge/stage.py) - Zero pip dependencies -- stdlib only
License
Apache 2.0