SCOUT

April 24, 2026 · View on GitHub

SCOUT

SCOUT

Firmware Security Analysis Pipeline with Deterministic Evidence Packaging

Drop a firmware blob. Get SARIF findings, CycloneDX SBOM+VEX, hash-anchored evidence, and analyst-ready reasoning trails -- in one command.

SCOUT is optimized for deep analysis of a single firmware image: it acts as an analyst copilot grounded in evidence lineage, not an autonomous verdict engine. Ghidra P-code taint, adversarial LLM adjudication, reasoning persistence across findings/reports/viewer/TUI, zero pip dependencies.


Python License Stages Zero Deps Version

SARIF SBOM SLSA


1,123
Corpus Targets
(Tier 1 refresh)
98.8%
Success Rate
(1110 / 1123)
146,943
CVE Matches
(Tier 1 refresh)
99.3%
LLM-Adjudicated FPR
(Tier 2 carry-over)
Pending
Pair-Eval FN/FP
(next lane)
Tier 1 fresh baseline: v2.6.1 corpus refresh, 2026-04-17, 1,123 firmware, success 1110 / partial 4 / fatal 9 · Tier 2 carry-over: v2.3.0, 2026-04-09, claude-code driver, 36 firmware

English (this file) | 한국어


Note

Tier 1 numbers in this README now reflect the fresh v2.6.1 corpus refresh (docs/carry_over_benchmark_v2.6.md): 1,123 targets, 1110 success / 4 partial / 9 fatal. Tier 2 LLM numbers are still carry-over (v2.3.0, 36 firmware) until the pair-eval lane lands. See docs/benchmark_governance.md, docs/carry_over_benchmark_v2.6.md, and benchmarks/baselines/v2.5.0/manifest.json.

Tip

What's new in v2.7.2 (Phase 2C++ detection-engine integrity patch — no scorecard movement expected)

  • Phase 2C++.1 — DECOMPILED_COLOCATED_CAP = 0.45 promoted to a named constant. The decompiled_colocated taint method previously hardcoded a 0.50 ceiling inline. The 5-tier cap ladder (SYMBOL_COOCCURRENCE 0.40 < DECOMPILED_COLOCATED 0.45 < STATIC_CODE_VERIFIED 0.55 < STATIC_ONLY 0.60 < PCODE_VERIFIED 0.75) is now externally cited. Consumer impact: decompiled_colocated traces drop 0.50 → 0.45 (-0.05); ROC thresholds previously pinned at 0.50 should be retuned to 0.45. priority_score and cve_scan's STATIC_CODE_VERIFIED_CAP=0.55 unchanged. Rationale: the v2.4.0 external review (docs/upgrade_plz.md Gap C) flagged the prior value as over-confident relative to body-text-only evidence.
  • Phase 2C++.2 — legacy addr_diff > 16 residues removed from ghidra_analysis.py and ghidra_scripts/pcode_taint.py. Commit 3352783 (v2.4.1) replaced the primary CALL-matching path with callee-name resolution but left a dead trace_pcode_forward() helper inside _PYGHIDRA_SCRIPT and an unreachable else: addr_diff fallback in the analyzeHeadless Strategy 1 loop. Both are now physically removed; _trace_forward_pcode()'s source_api_name parameter is required (no default). No runtime behaviour change — the production paths have done callee-name matching for 13 days already. Guard-rail tests in tests/test_ghidra_dead_code_removed.py pin the removal.
  • No scorecard movement expected. Gap B was runtime-effective since v2.4.1. Gap C's new ceiling only binds on decompiled_colocated traces, which are emitted solely by the pyghidra fallback (ghidra_analysis.py:609) that Ghidra-12 environments don't exercise. Phase 2D' Entry Gate remains at the v2.7.1 figure of record (2/5 PASS). Pair-eval re-measurement is deferred to the session that evaluates Gap A (interprocedural taint) ROI.
  • Pivot Option D (compliance-led identity) remains in force. v2.7.2 is detection-engine hygiene, not a behavioural pivot. The compliance_report stage and four standard mappings shipped in v2.7.0 are unchanged.

Tip

What's new in v2.7.1 (Phase 2C+.4 vendor corpus expansion — quantitative refinement of v2.7.0's scenario C)

  • Pair-eval corpus extended 7 → 12 pairs — five new vendor/model entries: D-Link DIR-859 (CVE-2019-17621), D-Link DIR-878 (vendor advisory), ASUS RT-AC68U (CVE-2020-15498), Linksys WRT1900AC v2 (progression), and Linksys EA6700 (progression). Manifest registration alone clears Phase 2D' Entry Gate 5 (corpus ≥ 10).
  • Phase 2D' Entry Gate scorecard: 1/5 → 2/5 PASS (Gate 4 Rerun + Gate 5 Corpus). Gate 1 recall improves 0.143 → 0.167 (+17% relative) but still FAIL; Gate 2 (tier variation) and Gate 3 (finding diversity 0.917) remain FAIL. The new TP/FP pair (DIR-859 vuln + patched both hit aiedge.findings.web.exec_sink_overlap) corroborates the v2.7.0 diagnosis that findings.py's single-synthesis-finding selection is the structural Gate 1/3 limit.
  • Honest measurement-of-record protocol — an intermediate 1st-pass measurement under partial WRT1900AC extractions transiently showed Gate 2 PASS due to aiedge.findings.analysis_incomplete populating the unknown tier; the figure of record is the ok-state measurement after the 2400-second budget rerun (Gate 2 reverts to FAIL). Documented in docs/v2.7.1_release_plan.md.
  • Scorer reliability fixscripts/score_pair_corpus.py now graceful-skips pairs with absent runs (vulnerable_status="missing" / patched_status="missing") instead of aborting on StopIteration, so corpus growth and partial-coverage measurements no longer crash the release gate.
  • Pivot Option D (compliance-led identity) remains in force — v2.7.1 is a quantitative refinement of v2.7.0's scenario C, not a re-pivot. The compliance_report stage and four standard mappings shipped in v2.7.0 are unchanged.

Why SCOUT?

Every finding has a hash-anchored evidence chain. No finding without a file path, byte offset, SHA-256 hash, and rationale. Artifacts are immutable and traceable from firmware blob to final verdict.

4-tier confidence caps with Ghidra P-code verification -- honest scoring. SYMBOL_COOCCURRENCE capped at 0.40, STATIC_CODE_VERIFIED at 0.55, STATIC_ONLY at 0.60, PCODE_VERIFIED at 0.75. Promotion to confirmed requires dynamic verification. We don't inflate scores.

SARIF + CycloneDX VEX + SLSA provenance -- standard formats. GitHub Code Scanning, VS Code, CI/CD integration out of the box.

Built for analyst-in-the-loop firmware review. SCOUT is strongest when used to start deep review on a single firmware image quickly, expose evidence paths, and preserve matched reasoning lineage across triage and reporting surfaces. Analyst hints loop back into next-run LLM adjudication via MCP, while final verdict ownership stays with the reviewer.


How It Works

  firmware.bin  ──>  42-stage pipeline  ──>  SARIF findings       ──>  Web viewer
                     (auto Ghidra)          CycloneDX SBOM+VEX       TUI dashboard
                     (auto CVE match)       Evidence chain            GitHub/VS Code
                     (optional LLM)         SLSA attestation          MCP for AI agents
# Full analysis
./scout analyze firmware.bin

# Static-only (no LLM, \$0)
./scout analyze firmware.bin --no-llm

# Pre-extracted rootfs
./scout analyze firmware.img --rootfs /path/to/rootfs

# Web viewer
./scout serve aiedge-runs/<run_id> --port 8080

# TUI dashboard
./scout ti                    # interactive (latest run)
./scout tw                    # watch mode (auto-refresh)

# MCP server for AI agents
./scout mcp --project-id aiedge-runs/<run_id>

Comparison

FeatureSCOUTFirmAgentEMBAFACTFirmAE
Scale (firmware tested)1,12314----1,124
SBOM (CycloneDX 1.6+VEX)YesNoYesNoNo
SARIF 2.1.0 ExportYesNoNoNoNo
Hash-Anchored Evidence ChainYesNoNoNoNo
SLSA L2 ProvenanceYesNoNoNoNo
Known CVE Signature MatchingYes (2,528 CVEs, 25 sigs)NoNoNoNo
Confidence Caps (honest scoring)YesNoNoNoNo
Ghidra Integration (auto-detect)YesIDA ProYesNoNo
AFL++ Fuzzing PipelineYesYesNoNoNo
Cross-Binary IPC ChainsYes (5 types)NoNoNoNo
Taint Propagation (LLM)YesYes (DeepSeek)NoNoNo
Adversarial FP ReductionYesNoNoNoNo
MCP Server (AI agent)YesNoNoNoNo
Web Report ViewerYesNoYesYesNo
Zero pip DependenciesYesNoNoNoNo

Key Features

FeatureDescription
:package:SBOM & CVECycloneDX 1.6 + VEX + 25 known CVE signatures (8 vendors) + NVD scan + 2,528 local CVE DB + EPSS scoring (FIRST.org API, batched + cached)
:mag:Binary AnalysisGhidra P-code SSA dataflow taint + ELF hardening (NX/PIE/RELRO/Canary/FORTIFY) + .dynstr detection + 28 sink symbols + format string detection
:dart:Attack SurfaceSource→sink tracing, web server auto-detection, cross-binary IPC chains (5 types: unix socket, dbus, shm, pipe, exec)
:brain:Taint AnalysisHTTP-aware inter-procedural taint, P-code SSA dataflow, call chain visualization, 4-strategy fallback (P-code → colocated → decompiled → interprocedural)
:robot:LLM Engine4 backends (Codex CLI / Claude API / Claude Code CLI / Ollama) + centralized system prompts + structured JSON output + 5-stage parser (preamble/fence/raw/brace-counting/error-recovery) + temperature control
:crossed_swords:LLM-Adjudicated DebateAdvocate/Critic LLM debate for LLM-adjudicated FPR reduction (99.3% on the Tier 2 carry-over baseline). Separate parse_failures vs llm_call_failures observability with quota_exhausted detection
:compass:Explainability Surface (v2.6.1)reasoning_trail persisted across findings, analyst Markdown, TUI, and web viewer so reviewers can inspect matched evidence lineage — not just a final score. Advocate / critic / decision / pattern-hit entries with 200-char excerpt redaction
:inbox_tray:Analyst-in-the-loop Channel (v2.6.1)4 MCP tools for reasoning lookup, hint injection, verdict override, and category filtering. Hints loop back into next-run advocate prompt via AIEDGE_FEEDBACK_DIR (opt-in, fcntl.flock-safe)
:triangular_ruler:Detection vs Priority Separation (v2.6.0)confidence stays evidence-bound (≤0.55 static cap); priority_score / priority_inputs capture EPSS, reachability, backport, and CVSS as ranking signals. See docs/scoring_calibration.md
:speedboat:Parallel DAG Execution (v2.6.0, PoC)--experimental-parallel [N] opt-in level-wise stage parallelism (ThreadPoolExecutor + Kahn topo levels). 15 levels / max-width 7 on the 42-stage pipeline. Sequential path unchanged
:shield:Security AssessmentX.509 cert scan, boot service audit, filesystem permission checks, credential mapping, hardcoded secret detection
:test_tube:Fuzzing (optional)AFL++ with CMPLOG, persistent mode, NVRAM faker, harness generation, crash triage
:bug:Emulation4-tier (FirmAE / Pandawan+FirmSolo / QEMU user-mode / rootfs inspection) + GDB remote debug
:electric_plug:MCP Server12 tools via Model Context Protocol for Claude Code/Desktop integration
:bar_chart:Web ViewerGlassmorphism dashboard with KPI bar, IPC map, risk heatmap, interactive evidence navigation
:link:Evidence ChainSHA-256 anchored artifacts + 4-tier confidence caps (0.40/0.55/0.60/0.75) + 5-tier exploit promotion ladder
:scroll:Standard OutputSARIF 2.1.0 (GitHub Code Scanning) + CycloneDX 1.6 + VEX + SLSA Level 2 in-toto attestation
:gear:CI/CD IntegrationGitHub Action (.github/actions/scout-scan/) with composite Docker action + automatic SARIF upload to GitHub Security tab
:scales:Regulatory AlignmentOutput formats compatible with EU CRA Annex I (docs/compliance_mapping/cra_annex_i.md); SBOM output compatible with FDA Section 524B guidance; output formats compatible with ISO 21434 / UN R155
:chart_with_upwards_trend:BenchmarkingFirmAE dataset (1,123 firmware), analyst-readiness scoring, verifier-backed archive bundles, TP/FP analysis scripts
:key:Vendor DecryptD-Link SHRS AES-128-CBC auto-decryption; Shannon entropy encryption detection (>7.9); binwalk v3 compatibility
:white_check_mark:Zero DependenciesPure Python 3.10+ stdlib only — no pip dependencies, air-gap friendly deployment

Analyst Copilot Surfaces

Explainability surface

  • reasoning_trail and evidence lineage are preserved across findings, analyst Markdown, TUI, web viewer, and SARIF properties.
  • This is where reviewers inspect why a finding was downgraded, upheld, or promoted.

Analyst-in-the-loop channel

  • MCP tools and AIEDGE_FEEDBACK_DIR are the supported override/hint path.
  • Human hints are allowed to influence the next run; final ownership still stays with the analyst.

Autonomous reasoning (future)

  • SCOUT is not positioned as a fully autonomous exploit agent in v2.6.1.
  • Multi-agent exploit chains, pair-grounded evaluation loops, and autonomous fuzz harness generation remain Phase 2D / reviewer-eval lane work.

Pipeline (42 Stages)

Firmware --> Unpack --> Profile --> Inventory --> Ghidra --> Semantic Classification
    --> SBOM --> CVE Scan --> Reachability --> Endpoints --> Surfaces
    --> Enhanced Source --> C-Source ID --> Taint Propagation
    --> FP Verification --> Adversarial Triage
    --> Graph --> Attack Surface --> Findings
    --> LLM Triage --> LLM Synthesis --> Emulation --> [Fuzzing]
    --> PoC Refinement --> Chain Construction --> Exploit Chain --> PoC --> Verification

Ghidra is auto-detected and enabled by default. Stages in [brackets] require optional external tools (AFL++/Docker).

Pipeline Stages Reference (42)
StageModulePurposeLLM?Cost
toolingtooling.pyExternal tool availability check (binwalk, Ghidra, Docker)No$0
extractionextraction.pyFirmware unpacking (binwalk + vendor_decrypt + Shannon entropy detection)No$0
structurestructure.pyFilesystem structure analysisNo$0
carvingcarving.pyFile carving from unstructured regionsNo$0
firmware_profilefirmware_profile.pyArchitecture, kernel, init system fingerprintingNo$0
inventoryinventory.pyPer-binary ELF hardening + symbol extractionNo$0
ghidra_analysisghidra_analysis.pyDecompilation + P-code SSA dataflow analysisNo$0
semantic_classificationsemantic_classifier.py3-pass function classifier (static → haiku → sonnet)YesLow
sbomsbom.pyCycloneDX 1.6 SBOM generation with VEXNo$0
cve_scancve_scan.pyNVD + 25 known signatures + EPSS enrichmentNo$0
reachabilityreachability.pyBFS-based call-graph reachabilityNo$0
endpointsendpoints.pyNetwork endpoint discoveryNo$0
surfacessurfaces.pyAttack surface enumerationNo$0
enhanced_sourceenhanced_source.pyWeb server auto-detection + INPUT_APIS scan (21 APIs)No$0
csource_identificationcsource_identification.pyHTTP input source identification via static sentinel + QEMUNo$0
taint_propagationtaint_propagation.pyInter-procedural taint with 28 sinks + format string detectionYesMedium
fp_verificationfp_verification.py3-pattern FP removal + LLM verification with parse/call failure separationYesLow
adversarial_triageadversarial_triage.pyAdvocate/Critic LLM debate (LLM-adjudicated FPR reduction, 99.3%)YesMedium
graphgraph.pyCommunication graph (5 IPC edge types)No$0
attack_surfaceattack_surface.pyAttack surface mapping with IPC chainsNo$0
attributionattribution.pyVendor/firmware attributionNo$0
functional_specfunctional_spec.pyFunctional specification extractionNo$0
threat_modelthreat_model.pySTRIDE-based threat modelingNo$0
web_uiweb_ui.pyWeb UI / CGI endpoint analysisNo$0
findingsfindings.pyFinding aggregation + SARIF exportNo$0
llm_triagellm_triage.pyLLM finding triage (haiku/sonnet/opus auto-routing)YesVariable
llm_synthesisllm_synthesis.pyLLM finding synthesisYesMedium
emulationemulation.py4-tier emulation (FirmAE / Pandawan / QEMU / rootfs)No$0
dynamic_validationdynamic_validation.pyDynamic behavior verificationNo$0
fuzzingfuzz_*.pyAFL++ fuzzing with NVRAM fakerNo$0
poc_refinementpoc_refinement.pyIterative PoC generation (5 attempts)YesMedium
chain_constructionchain_constructor.pySame-binary + cross-binary IPC exploit chainsNo$0
exploit_gatestage_registry.pyExploit promotion gateNo$0
exploit_chainexploit_chain.pyExploit chain validationNo$0
exploit_autopocexploit_autopoc.pyAutomated PoC orchestrationYesMedium
poc_validationpoc_validation.pyPoC reproduction validationNo$0
exploit_policyexploit_policy.pyFinal exploit promotion decisionNo$0

OTA-specific stages: ota, ota_payload, ota_fs, ota_roots, ota_boottriage, firmware_lineage (Android-style OTA payload analysis).

Benchmarks

Tier 1 (Static, frozen baseline)

Baseline: v2.6.1, 2026-04-17, fresh corpus refresh (docs/carry_over_benchmark_v2.6.md)

  • 1,123 firmware / 8 vendors / 98.8% success rate
  • 1,110 success / 4 partial / 9 fatal
  • 3,531 findings / 146,943 CVE matches
  • 1,089 / 1,110 successful runs produced nonzero CVE output

Tier 2 (LLM-Adjudicated Adversarial Debate, GPT-5.3-Codex)

Baseline: v2.3.0, 2026-04-09, claude-code driver (carry-over; pair-eval lane still pending)

  • 36 firmware / 9 vendors
  • 2,430 findings debated → 2,412 downgraded + 18 maintained
  • LLM-adjudicated FPR reduction: 99.3% | pair-grounded FN/FP: pending reviewer eval lane

v2.6.0 Post-merge Real-Firmware Validation

This section records post-release real-firmware validation runs, distinct from the carry-over corpus baselines above.

Validation target 1 — Netgear R7000 (codex driver, --experimental-parallel 4)

Metricv2.5.0v2.6.0
adversarial_triage parse_failures0/1000/100 (100 debated, 97 downgraded, 3 maintained)
fp_verification unverified0/1000/100 (100 verified: 56 TP, 44 FP)
reasoning_trail_count (top-level findings)N/A0/3 top-level / 100/100 at adversarial_triage + fp_verification artifacts ¹
findings with priority_scoreN/A3/3 (100% additive priority annotation)
priority_bucket_countsN/A{critical: 0, high: 0, medium: 3, low: 0}
category distributionN/A{vulnerability: 1, pipeline_artifact: 2, misconfiguration: 0, unclassified: 0}
cve_scan EPSS enriched23/230 (stage skipped — sbom landed partial and cve_scan/reachability skip on sbom dependency failure ²)
--experimental-parallel 4 wall-clockN/A~170 minutes end-to-end across the registered pipeline (fp_verification dominant at 113 min; no sequential baseline for delta)

¹ v2.6.0 → v2.6.1 follow-up (commit 7b36274): the top-level synthesis finding (web.exec_sink_overlap) now inherits matched downstream evidence lineage instead of relying only on the stage-level aggregate summary. Matching prefers run-relative binary path, falls back to binary SHA-256, and samples representative downstream trail entries deterministically so the synthesis finding reflects the alerts that actually informed it. This R7000 run reflects the v2.6.0 shipped behaviour.

² v2.6.0 → v2.6.1 follow-up (commit 8e0bb82): the R7000 extraction actually succeeded (1,664 files, 2,412 binaries scanned under squashfs-root), but the SBOM stage returned 0 components on this firmware due to a silent schema mismatch — _collect_so_files_from_inventory read inventory.file_list (a pre-v2.x key no longer emitted) and _detect_from_binary_analysis expected per-entry string_hits (replaced by matched_symbols in the current inventory schema). OpenWrt hid the bug because its opkg database alone contributes 100+ components. The fix makes both helpers walk inventory.roots directly and fall back to reading the binary file contents via a new _extract_ascii_runs helper. A clean re-run of just SbomStage on this R7000 run raises the component count from 0 to 4 (curl 7.36.0 via binary read, plus openssl 1.0.0 / libz 1 / libpthread 0 via .so* walking). Downstream cve_scan / reachability would then produce real CVE + EPSS numbers on a full pipeline re-run.

Metricv2.6.0
total findings3
reasoning_trail_count0 (no-llm: adversarial_triage and fp_verification are LLM-gated; trail is populated only when LLM stages run)
findings with priority_score3 / 3 (100% — additive priority annotation succeeded for all findings)
priority_bucket_counts{critical: 0, high: 0, medium: 3, low: 0}
category distribution{vulnerability: 1, pipeline_artifact: 2, misconfiguration: 0, unclassified: 0} (PR #7a 3-category ontology, 0% unclassified rate)
notable caveatsOpenWrt is squashfs ext4 root; binwalk extracted cleanly; --no-llm path skipped reasoning_trail generation as expected. Run completed end-to-end through findings stage.

See CHANGELOG.md for full version history and docs/scoring_calibration.md for the two-score contract.


Architecture

+--------------------------------------------------------------------+
|                       SCOUT (Evidence Engine)                      |
|                                                                    |
|  Firmware --> Unpack --> Profile --> Inventory --> SBOM --> CVE    |
|                          |            |            |          |    |
|                       Ghidra     Binary Audit   40+ sigs    NVD+   |
|                       auto-detect  NX/PIE/etc              local DB|
|                                                                    |
|  --> Taint --> FP Filter --> Attack Surface --> Findings           |
|     (HTTP-aware)  (3-pattern)   (IPC chains)    (SARIF 2.1.0)      |
|                                                                    |
|  --> Emulation --> [Fuzzing] --> Exploit Chain --> PoC --> Verify  |
|                                                                    |
|  42 stages . SHA-256 manifests . 4-tier confidence caps (0.40/0.55/0.60/0.75) |
|  Outputs: SARIF + CycloneDX VEX + SLSA L2 + Markdown reports       |
+--------------------------------------------------------------------+
|                    Handoff (firmware_handoff.json)                 |
+--------------------------------------------------------------------+
|                     Terminator (Orchestrator)                      |
|  LLM Tribunal --> Dynamic Validation --> Verified Chain            |
+--------------------------------------------------------------------+
LayerRoleDeterministic?
SCOUTEvidence production (42 stages)Yes
HandoffJSON contract between engine and orchestratorYes
TerminatorLLM tribunal, dynamic validation, exploit devNo (auditable)

Exploit Promotion Policy

LevelRequirementsPlacement
dismissedCritic rebuttal strong or confidence < 0.5Appendix only
candidateConfidence 0.5-0.8, evidence exists but chain incompleteReport (flagged)
high_confidence_staticConfidence >= 0.8, strong static evidence, no dynamicReport (highlighted)
confirmedConfidence >= 0.8 AND >= 1 dynamic verification artifactReport (top)
verified_chainConfirmed AND PoC reproduced 3x in sandboxExploit report

CLI Reference
CommandDescription
./scout analyze <firmware>Full 42-stage analysis pipeline
./scout analyze <firmware> --quietSuppress real-time progress output (CI/scripted use)
./scout analyze-8mb <firmware>Truncated 8MB canonical track
./scout stages <run_dir> --stages X,YRerun specific stages
./scout serve <run_dir>Launch web report viewer
./scout mcp [--project-id <id>]Start MCP stdio server
./scout tui <run_dir>Terminal UI dashboard
./scout tiTUI interactive (latest run)
./scout twTUI watch mode (auto-refresh)
./scout toTUI one-shot (latest run)
./scout tTUI default (latest run)
./scout corpus-validateValidate corpus manifest
./scout quality-metricsCompute quality metrics
./scout quality-gateCheck quality thresholds
./scout release-quality-gateUnified release gate

Exit codes: 0 success, 10 partial, 20 fatal, 30 policy violation

Benchmarking
# FirmAE dataset benchmark (1,123 usable firmware images in the current frozen baseline)
./scripts/benchmark_firmae.sh --parallel 8 --time-budget 1800 --cleanup

# Options
--dataset-dir DIR       # Firmware directory (default: aiedge-inputs/firmae-benchmark)
--results-dir DIR       # Output directory
--file-list PATH        # Explicit newline-delimited firmware list
--parallel N            # Concurrent jobs (default: 4)
--time-budget S         # Seconds per firmware (default: 600)
--stages STAGES         # Specific stages (default: full pipeline)
--max-images N          # Limit images (0 = all)
--llm                   # Enable LLM-backed stages
--8mb                   # Use 8MB truncated track
--full                  # Include dynamic stages
--cleanup               # Preserve a verifier-friendly run replica under results/archives/, then delete original run dirs
--dry-run               # List files without running

# Analyst-readiness re-evaluation for an existing benchmark-results tree
python3 scripts/reevaluate_benchmark_results.py \
  --results-dir benchmark-results/<run>

# Normalize legacy bundles and rerun a stage subset (useful for debugging archive fidelity issues)
python3 scripts/rerun_benchmark_stages.py \
  --results-dir benchmark-results/<legacy-run> \
  --out-dir benchmark-results/<rerun-out> \
  --stages attribution,graph,attack_surface \
  --no-llm

# Post-benchmark analysis
PYTHONPATH=src python3 scripts/cve_rematch.py \
  --results-dir benchmark-results/firmae-YYYYMMDD_HHMM \
  --nvd-dir data/nvd-cache \
  --csv-out cve_matches.csv

PYTHONPATH=src python3 scripts/analyze_findings.py \
  --results-dir benchmark-results/firmae-YYYYMMDD_HHMM \
  --output analysis_report.json

# FirmAE dataset setup
./scripts/unpack_firmae_dataset.sh [ZIP_FILE]

# Tier 1 frozen baseline docs
# - docs/tier1_rebenchmark_frozen_baseline.md
# - docs/tier1_rebenchmark_final_analysis.md

Current benchmark contract

  • Archived benchmark bundles are now expected to be run replicas, not flattened JSON snapshots.
  • Benchmark quality is reported in two layers:
    • analysis rate = pipeline completed (success + partial)
    • analyst-ready rate = archived bundle passes analyst/verifier checks and remains evidence-navigable
  • benchmark-results/legacy/tier2-llm-v2 is a legacy snapshot. It is useful for historical reference and re-evaluation, but it should not be used as the canonical analyst-readiness baseline.
  • The current contract has been validated on a fresh single-sample run (benchmark-results/tier2-single-fidelity) where both analyst verifiers passed from the archived bundle.

Current LLM quality behavior

  • llm_triage model routing: <=10 haiku, 11-50 sonnet, >50 or chain-backed opus
  • llm_triage retries with sonnet if a haiku call exits non-zero
  • llm_triage, semantic_classification, adversarial_triage, and fp_verification now write stages/<stage>/llm_trace/*.json
  • Parse failures are handled fail-closed: repaired when possible, otherwise reported as degraded/partial instead of silently treated as clean success
Environment Variables

Core

VariableDefaultDescription
AIEDGE_LLM_DRIVERcodexLLM provider: codex / claude / claude-code / ollama
ANTHROPIC_API_KEY--API key for Claude driver (not needed for claude-code)
AIEDGE_OLLAMA_URLhttp://localhost:11434Ollama server URL
AIEDGE_LLM_BUDGET_USD--LLM cost budget limit
AIEDGE_PRIV_RUNNER--Privileged command prefix for dynamic stages
AIEDGE_FEEDBACK_DIRaiedge-feedbackTerminator feedback directory

Ghidra

VariableDefaultDescription
AIEDGE_GHIDRA_HOMEauto-detectGhidra install path; probes /opt/ghidra_*, /usr/local/ghidra*
AIEDGE_GHIDRA_MAX_BINARIES20Max binaries to analyze
AIEDGE_GHIDRA_TIMEOUT_S300Per-binary analysis timeout

SBOM & CVE

VariableDefaultDescription
AIEDGE_NVD_API_KEY--NVD API key (optional, improves rate limits)
AIEDGE_NVD_CACHE_DIR--Cross-run NVD response cache
AIEDGE_SBOM_MAX_COMPONENTS500Maximum SBOM components
AIEDGE_CVE_SCAN_MAX_COMPONENTS50Maximum components to CVE-scan
AIEDGE_CVE_SCAN_TIMEOUT_S30Per-request NVD API timeout

Fuzzing & Emulation

VariableDefaultDescription
AIEDGE_AFLPP_IMAGEaflplusplus/aflplusplusAFL++ Docker image
AIEDGE_FUZZ_BUDGET_S3600Fuzzing time budget (seconds)
AIEDGE_FUZZ_MAX_TARGETS5Max fuzzing target binaries
AIEDGE_EMULATION_IMAGEscout-emulation:latestEmulation Docker image
AIEDGE_FIRMAE_ROOT/opt/FirmAEFirmAE installation path
AIEDGE_QEMU_GDB_PORT1234QEMU GDB remote port

Quality Gates

VariableDefaultDescription
AIEDGE_QG_PRECISION_MIN0.9Minimum precision threshold
AIEDGE_QG_RECALL_MIN0.6Minimum recall threshold
AIEDGE_QG_FPR_MAX0.1Maximum false positive rate
Run Directory Structure
aiedge-runs/<run_id>/
├── manifest.json
├── firmware_handoff.json
├── provenance.intoto.jsonl           # SLSA L2 attestation
├── input/firmware.bin
├── stages/
│   ├── extraction/                   # Unpacked filesystem
│   ├── inventory/
│   │   └── binary_analysis.json      # Per-binary hardening + symbols
│   ├── enhanced_source/
│   │   └── sources.json              # HTTP input sources + web server detection
│   ├── sbom/
│   │   ├── sbom.json                 # CycloneDX 1.6
│   │   └── vex.json                  # VEX exploitability
│   ├── cve_scan/
│   │   └── cve_matches.json          # NVD + known signature matches
│   ├── taint_propagation/
│   │   └── taint_results.json        # Taint paths + call chains
│   ├── ghidra_analysis/              # Decompiled functions (optional)
│   ├── chain_construction/
│   │   └── chains.json               # Same-binary + cross-binary IPC chains
│   ├── findings/
│   │   ├── findings.json             # All findings
│   │   ├── pattern_scan.json         # Static pattern matches
│   │   ├── sarif.json                # SARIF 2.1.0 export
│   │   └── stage.json                # SHA-256 manifest
│   └── ...                           # 42 stage directories total
└── report/
    ├── viewer.html                   # Web dashboard
    ├── report.json
    ├── analyst_digest.json
    └── executive_report.md
Verification Scripts
# Evidence chain integrity
python3 scripts/verify_analyst_digest.py --run-dir aiedge-runs/<run_id>
python3 scripts/verify_verified_chain.py --run-dir aiedge-runs/<run_id>

# Report schema compliance
python3 scripts/verify_aiedge_final_report.py --run-dir aiedge-runs/<run_id>
python3 scripts/verify_aiedge_analyst_report.py --run-dir aiedge-runs/<run_id>

# Security invariants
python3 scripts/verify_run_dir_evidence_only.py --run-dir aiedge-runs/<run_id>
python3 scripts/verify_network_isolation.py --run-dir aiedge-runs/<run_id>

# Quality gates
./scout release-quality-gate aiedge-runs/<run_id>

Documentation

DocumentPurpose
BlueprintPipeline architecture and design rationale
StatusCurrent implementation status
Artifact SchemaProfiling + inventory contracts
Adapter ContractTerminator-SCOUT handoff protocol
Report ContractReport structure and governance
Analyst DigestDigest schema and verdicts
Verified ChainEvidence requirements
Duplicate GateCross-run dedup rules
Known CVE Ground TruthCVE validation dataset
Upgrade Plan v2v2.0 upgrade plan
LLM RoadmapLLM integration strategy

Security & Ethics

Authorized environments only.

SCOUT is intended for contracted security audits, vulnerability research (responsible disclosure), and CTF/training in lab environments. Dynamic validation runs in network-isolated sandbox containers. No weaponized payloads are included.


Contributing

  1. Read Blueprint for architecture context
  2. Run pytest -q -- all tests must pass
  3. Lint ruff check src/ -- zero violations
  4. Follow the Stage protocol (src/aiedge/stage.py)
  5. Zero pip dependencies -- stdlib only

License

Apache 2.0


Built for the security research community. Not for unauthorized access.


github.com/R00T-Kim/SCOUT