AgentRE-Bench
May 12, 2026 · View on GitHub
A benchmark for evaluating LLM agents on long-horizon reverse engineering tasks with deterministic scoring.
Platform: Linux/Unix (ELF x86-64). Windows PE support planned for a future release.
AgentRE-Bench gives an LLM agent a compiled ELF binary and a set of Linux static analysis tools (strings, objdump, readelf, etc.), then measures how well it can identify C2 infrastructure, encoding schemes, anti-analysis techniques, and communication protocols — all without human guidance.
Why This Benchmark?
Why Synthetic?
All 13 binaries are compiled from purpose-built C sources with known ground truths. This gives us:
- Deterministic judging — every field has an exact expected answer, no ambiguity
- Controlled difficulty progression — from plaintext TCP shells (level 1) to metamorphic droppers with RC4 encryption (level 13)
- Reproducibility — anyone can compile identical binaries and verify scores
Real malware would require subjective expert judgment and introduce licensing, ethics, and reproducibility issues. Synthetic samples eliminate all of that while testing the same analytical capabilities.
Why Agentic?
Traditional RE benchmarks ask a model a question and check the answer. AgentRE-Bench requires the agent to:
- Plan which tools to use and in what order
- Interpret raw tool output (hex dumps, disassembly, symbol tables)
- Synthesize findings across multiple tool calls into a structured analysis
- Manage a budget of 25 tool calls — wasting calls on redundant queries means running out before finding the answer
This tests reasoning, tool selection, and information synthesis — not just pattern matching.
Why Long-Horizon?
Simple RE questions ("What architecture is this binary?") don't differentiate models. The hard problems require chains of 10-25 tool calls where each call's output informs the next decision. Level 13 requires the agent to:
- Identify encrypted strings via entropy analysis
- Locate the encryption key in the binary
- Determine the key storage mechanism (XOR mask)
- Decode the actual C2 URL
- Identify 18 distinct techniques across anti-debugging, process injection, and network evasion
This is where agent capability differences become visible.
Why Deterministic Judging?
Every agent answer is scored against a fixed ground truth with weighted fields and Jaccard overlap for set comparisons. There is no LLM-as-judge, no subjective rubric, no human grader variance. The same answer always produces the same score.
Hallucinations are penalized: claiming a technique not present in the binary costs -0.05 per false claim. This means models can't game scores by guessing everything.
Task Difficulty Progression
| Level | Task | Techniques | Difficulty |
|---|---|---|---|
| 1 | TCP Reverse Shell | Plaintext C2, socket connect, dup2, execve | Trivial |
| 2 | XOR Encoded Strings | XOR encoding, string obfuscation | Easy |
| 3 | Anti-Debugging Shell | ptrace detection, timing checks | Easy |
| 4 | Polymorphic Shell | Self-modifying code, runtime decryption | Medium |
| 5 | Multistage Shell | Staged payload delivery | Medium |
| 6 | ICMP Covert Channel | ICMP protocol C2, covert channel | Medium |
| 7 | DNS Tunnel Shell | DNS-based C2 tunneling | Medium |
| 8 | Process Hollowing | Process injection, memory manipulation | Hard |
| 9 | Shared Object Injection | .so injection, dlopen/dlsym | Hard |
| 10 | AES Encrypted Shell | AES encryption, key recovery | Hard |
| 11 | Fork Bomb Shell | Process evasion, fork techniques | Hard |
| 12 | JIT Compiled Shellcode | Runtime code generation, JIT | Very Hard |
| 13 | Metamorphic Dropper | RC4 encryption, anti-analysis, metamorphic code | Bonus |
Levels 1-12 are standard tasks (averaged to 1.0 pt max). Level 13 is a bonus task with a deeper rubric (1.0 pt max). Total possible: 2.0 points.
Scoring Model
Standard Levels (1-12)
Each task is scored across 5 weighted fields:
| Field | Weight | Scoring |
|---|---|---|
decoded_c2 | 0.40 | Exact match = 1.0, host-only match = 0.5 |
techniques | 0.30 | Jaccard overlap between predicted and ground truth sets |
file_type | 0.10 | Exact match (case-insensitive) |
encoded_strings | 0.10 | Exact match (boolean) |
c2_protocol | 0.10 | Exact match (case-insensitive) |
Hallucination penalty: -0.05 per technique claimed but not in ground truth.
Bonus Level (13)
10 weighted fields including encryption algorithm, key, key storage mechanism, decoded strings, and anti-analysis techniques. Lighter hallucination penalty (-0.03) given the larger technique set.
Aggregate Scoring
Main Score = average(level_1_score, ..., level_12_score) # 0.0 - 1.0
Bonus Score = level_13_score # 0.0 - 1.0
Total Score = Main Score + Bonus Score # 0.0 - 2.0
Benchmark Metrics
Beyond correctness, AgentRE-Bench records research-grade metrics for every task:
Per-Task Metrics:
| Metric | Description |
|---|---|
score | Final weighted score after hallucination penalty |
field_scores | Per-field breakdown (decoded_c2, techniques, etc.) |
tool_calls_total | Number of tool calls used |
tool_calls_by_type | Distribution across tool types |
redundant_tool_calls | Identical tool calls repeated (same name + args) |
invalid_tool_calls | Tool calls that returned errors |
invalid_json_attempts | Times the agent responded with text instead of a tool call |
hallucinated_techniques | Techniques claimed but not in ground truth |
missing_techniques | Ground truth techniques the agent failed to identify |
steps_to_answer | Tool calls before submitting final answer |
max_steps_hit | Whether the agent exhausted its 25-call budget |
wall_time_seconds | End-to-end wall clock time |
input_tokens / output_tokens | Token consumption |
Aggregate Metrics:
| Metric | Description |
|---|---|
success_rate | Fraction of tasks with a valid submitted answer |
avg_tool_calls_per_task | Mean tool calls across all tasks |
avg_tool_calls_per_success | Mean tool calls for tasks that got an answer |
avg_hallucination_rate | Mean hallucinated technique count per task |
episode_length_* | Wall time distribution (min/max/mean/median) |
tool_usage_distribution | Which tools models prefer across all tasks |
max_steps_hit_count | How often agents exhaust their budget |
These metrics enable failure taxonomy — categorizing failures into:
- Byte-level reasoning failure
- Control-flow misinterpretation
- API hallucination
- Tool misuse
- Early termination
- JSON format violation
Architecture
run_benchmark.py CLI entry point
|
v
harness/
config.py Configuration (dataclass, .env loading)
runner.py Orchestrator (load tasks, run agent, score, report)
agent.py Provider-agnostic agent loop (tool calling)
tools.py Tool schemas + ToolExecutor dispatch
sandbox.py PathValidator + DockerRunner / SubprocessRunner
metrics.py TaskMetrics + AggregateMetrics collection
providers/
base.py Abstract AgentProvider + ProviderResponse
anthropic.py Claude (raw HTTP to Messages API)
openai_provider.py GPT (raw HTTP to Chat Completions API)
gemini.py Gemini (raw HTTP to GenerativeAI API)
deepseek.py DeepSeek (extends OpenAI-compatible provider)
scorer.py Deterministic scorer (standalone + used by harness)
tasks.json Task manifest (13 entries)
build_binaries.sh Docker cross-compile script
Dockerfile.tools Sandboxed tool execution image
Zero Python dependencies. All LLM provider calls use Python's built-in urllib.request. No SDKs required.
Tool Sandbox
All tools execute inside Docker containers with strict isolation:
docker run --rm --platform linux/amd64 \
--network=none --read-only --memory=512m --cpus=1 \
-v binaries:/workspace:ro \
agentre-bench-tools:latest <command>
--network=none— no network access--read-only— immutable filesystem--memory=512m— memory cap- Workspace mounted read-only
Available Tools
| Tool | Description |
|---|---|
file | File type identification |
strings | Extract printable strings (configurable min length) |
readelf | ELF headers, sections, symbols, program headers |
objdump | Disassembly, symbol tables, section contents |
nm | Symbol listing |
hexdump | Hex + ASCII dump at specific offsets |
xxd | Hex dump (alternative format) |
entropy | Shannon entropy per sliding window (detects encrypted/compressed data) |
Plus final_answer — a structured submission tool the agent calls when done.
Setup
Prerequisites
- Python 3.10+
- Linux x86-64: gcc, binutils, file, xxd, python3 (for local builds and
--no-dockermode) - macOS: Docker (for cross-compilation and sandboxed tool execution)
1. Clone and Configure
git clone https://github.com/agentrebench/AgentRE-Bench.git
cd AgentRE-Bench
# Create .env with your API key(s)
cp .env.example .env
# Edit .env — add at least one provider key
2. Build Binaries
chmod +x build_binaries.sh
./build_binaries.sh
On Linux x86-64: uses local gcc directly (install with apt install gcc if needed — no Docker required).
On macOS / Apple Silicon: uses Docker with --platform linux/amd64 to cross-compile.
3. Build Tools Image
docker build --platform linux/amd64 -t agentre-bench-tools:latest -f Dockerfile.tools .
4. Run
# Single task with verbose output
python run_benchmark.py --task level1_TCPServer -v
# Full benchmark
python run_benchmark.py --all
# Different providers
python run_benchmark.py --all --provider anthropic --model claude-opus-4-6
python run_benchmark.py --all --provider openai --model gpt-4o
python run_benchmark.py --all --provider gemini --model gemini-2.0-flash
python run_benchmark.py --all --provider deepseek --model deepseek-chat
# Custom output directory
python run_benchmark.py --all --report results/opus_run1/
CLI Flags
| Flag | Default | Description |
|---|---|---|
--all | Run all 13 tasks | |
--task ID | Run a single task by ID | |
--provider | anthropic | anthropic, openai, gemini, deepseek |
--model | per-provider | Model name |
--api-key | from .env | API key override |
--report DIR | results/ | Output directory |
--max-tool-calls | 25 | Tool call budget per task |
--max-tokens | 4096 | Max tokens per LLM response |
--no-docker | Run tools via local subprocess | |
-v | Verbose: show agent reasoning + tool I/O live |
Output
results/
agent_outputs/ Raw agent JSON answers (one per task)
transcripts/ Per-task scoring, metrics, and full message logs
benchmark_report.json Aggregate report with all metrics and scores
Standalone Scorer
The scorer works independently of the agent harness:
# Single sample
python scorer.py -g ground_truths/level1_TCPServer.json \
-a agent_outputs/level1_TCPServer.json
# Batch
python scorer.py -G ground_truths/ -A agent_outputs/ -r report.json
Known Limitations
- Static analysis only — no dynamic execution, debugging, or sandboxed runtime. Tests static RE reasoning specifically.
- Synthetic samples — designed to test real RE skills, but production malware has additional complexity (packers, anti-VM, polymorphism at scale) not fully represented.
- Fixed tool set — agents can't install tools, write scripts, or use Ghidra/IDA. Standardizes evaluation but limits agent creativity.
- Single-agent — no multi-agent collaboration or human-in-the-loop.
- Token cost — a full 13-task run uses ~5-10M tokens on frontier models. Budget accordingly.
- Linux/Unix only — all binaries are ELF x86-64 targeting Linux/Unix systems. No Windows PE, ARM, or MIPS samples yet.
Roadmap
- Failure taxonomy — systematic categorization of failure modes across models
- Baseline comparisons — published results for Claude, GPT, Gemini, and open models
- Dynamic analysis tools — strace, ltrace, sandboxed execution
- Windows PE support — Windows binaries, ARM targets, packed samples
- Multi-turn refinement — tasks requiring iterative hypothesis refinement
- Public leaderboard — model comparison across providers and versions
License
MIT