AgentRE-Bench

May 12, 2026 · View on GitHub

A benchmark for evaluating LLM agents on long-horizon reverse engineering tasks with deterministic scoring.

Platform: Linux/Unix (ELF x86-64). Windows PE support planned for a future release.

AgentRE-Bench gives an LLM agent a compiled ELF binary and a set of Linux static analysis tools (strings, objdump, readelf, etc.), then measures how well it can identify C2 infrastructure, encoding schemes, anti-analysis techniques, and communication protocols — all without human guidance.

Why This Benchmark?

Why Synthetic?

All 13 binaries are compiled from purpose-built C sources with known ground truths. This gives us:

  • Deterministic judging — every field has an exact expected answer, no ambiguity
  • Controlled difficulty progression — from plaintext TCP shells (level 1) to metamorphic droppers with RC4 encryption (level 13)
  • Reproducibility — anyone can compile identical binaries and verify scores

Real malware would require subjective expert judgment and introduce licensing, ethics, and reproducibility issues. Synthetic samples eliminate all of that while testing the same analytical capabilities.

Why Agentic?

Traditional RE benchmarks ask a model a question and check the answer. AgentRE-Bench requires the agent to:

  • Plan which tools to use and in what order
  • Interpret raw tool output (hex dumps, disassembly, symbol tables)
  • Synthesize findings across multiple tool calls into a structured analysis
  • Manage a budget of 25 tool calls — wasting calls on redundant queries means running out before finding the answer

This tests reasoning, tool selection, and information synthesis — not just pattern matching.

Why Long-Horizon?

Simple RE questions ("What architecture is this binary?") don't differentiate models. The hard problems require chains of 10-25 tool calls where each call's output informs the next decision. Level 13 requires the agent to:

  1. Identify encrypted strings via entropy analysis
  2. Locate the encryption key in the binary
  3. Determine the key storage mechanism (XOR mask)
  4. Decode the actual C2 URL
  5. Identify 18 distinct techniques across anti-debugging, process injection, and network evasion

This is where agent capability differences become visible.

Why Deterministic Judging?

Every agent answer is scored against a fixed ground truth with weighted fields and Jaccard overlap for set comparisons. There is no LLM-as-judge, no subjective rubric, no human grader variance. The same answer always produces the same score.

Hallucinations are penalized: claiming a technique not present in the binary costs -0.05 per false claim. This means models can't game scores by guessing everything.

Task Difficulty Progression

LevelTaskTechniquesDifficulty
1TCP Reverse ShellPlaintext C2, socket connect, dup2, execveTrivial
2XOR Encoded StringsXOR encoding, string obfuscationEasy
3Anti-Debugging Shellptrace detection, timing checksEasy
4Polymorphic ShellSelf-modifying code, runtime decryptionMedium
5Multistage ShellStaged payload deliveryMedium
6ICMP Covert ChannelICMP protocol C2, covert channelMedium
7DNS Tunnel ShellDNS-based C2 tunnelingMedium
8Process HollowingProcess injection, memory manipulationHard
9Shared Object Injection.so injection, dlopen/dlsymHard
10AES Encrypted ShellAES encryption, key recoveryHard
11Fork Bomb ShellProcess evasion, fork techniquesHard
12JIT Compiled ShellcodeRuntime code generation, JITVery Hard
13Metamorphic DropperRC4 encryption, anti-analysis, metamorphic codeBonus

Levels 1-12 are standard tasks (averaged to 1.0 pt max). Level 13 is a bonus task with a deeper rubric (1.0 pt max). Total possible: 2.0 points.

Scoring Model

Standard Levels (1-12)

Each task is scored across 5 weighted fields:

FieldWeightScoring
decoded_c20.40Exact match = 1.0, host-only match = 0.5
techniques0.30Jaccard overlap between predicted and ground truth sets
file_type0.10Exact match (case-insensitive)
encoded_strings0.10Exact match (boolean)
c2_protocol0.10Exact match (case-insensitive)

Hallucination penalty: -0.05 per technique claimed but not in ground truth.

Bonus Level (13)

10 weighted fields including encryption algorithm, key, key storage mechanism, decoded strings, and anti-analysis techniques. Lighter hallucination penalty (-0.03) given the larger technique set.

Aggregate Scoring

Main Score  = average(level_1_score, ..., level_12_score)    # 0.0 - 1.0
Bonus Score = level_13_score                                  # 0.0 - 1.0
Total Score = Main Score + Bonus Score                        # 0.0 - 2.0

Benchmark Metrics

Beyond correctness, AgentRE-Bench records research-grade metrics for every task:

Per-Task Metrics:

MetricDescription
scoreFinal weighted score after hallucination penalty
field_scoresPer-field breakdown (decoded_c2, techniques, etc.)
tool_calls_totalNumber of tool calls used
tool_calls_by_typeDistribution across tool types
redundant_tool_callsIdentical tool calls repeated (same name + args)
invalid_tool_callsTool calls that returned errors
invalid_json_attemptsTimes the agent responded with text instead of a tool call
hallucinated_techniquesTechniques claimed but not in ground truth
missing_techniquesGround truth techniques the agent failed to identify
steps_to_answerTool calls before submitting final answer
max_steps_hitWhether the agent exhausted its 25-call budget
wall_time_secondsEnd-to-end wall clock time
input_tokens / output_tokensToken consumption

Aggregate Metrics:

MetricDescription
success_rateFraction of tasks with a valid submitted answer
avg_tool_calls_per_taskMean tool calls across all tasks
avg_tool_calls_per_successMean tool calls for tasks that got an answer
avg_hallucination_rateMean hallucinated technique count per task
episode_length_*Wall time distribution (min/max/mean/median)
tool_usage_distributionWhich tools models prefer across all tasks
max_steps_hit_countHow often agents exhaust their budget

These metrics enable failure taxonomy — categorizing failures into:

  • Byte-level reasoning failure
  • Control-flow misinterpretation
  • API hallucination
  • Tool misuse
  • Early termination
  • JSON format violation

Architecture

run_benchmark.py              CLI entry point
  |
  v
harness/
  config.py                   Configuration (dataclass, .env loading)
  runner.py                   Orchestrator (load tasks, run agent, score, report)
  agent.py                    Provider-agnostic agent loop (tool calling)
  tools.py                    Tool schemas + ToolExecutor dispatch
  sandbox.py                  PathValidator + DockerRunner / SubprocessRunner
  metrics.py                  TaskMetrics + AggregateMetrics collection
  providers/
    base.py                   Abstract AgentProvider + ProviderResponse
    anthropic.py              Claude (raw HTTP to Messages API)
    openai_provider.py        GPT (raw HTTP to Chat Completions API)
    gemini.py                 Gemini (raw HTTP to GenerativeAI API)
    deepseek.py               DeepSeek (extends OpenAI-compatible provider)

scorer.py                     Deterministic scorer (standalone + used by harness)
tasks.json                    Task manifest (13 entries)
build_binaries.sh             Docker cross-compile script
Dockerfile.tools              Sandboxed tool execution image

Zero Python dependencies. All LLM provider calls use Python's built-in urllib.request. No SDKs required.

Tool Sandbox

All tools execute inside Docker containers with strict isolation:

docker run --rm --platform linux/amd64 \
  --network=none --read-only --memory=512m --cpus=1 \
  -v binaries:/workspace:ro \
  agentre-bench-tools:latest <command>
  • --network=none — no network access
  • --read-only — immutable filesystem
  • --memory=512m — memory cap
  • Workspace mounted read-only

Available Tools

ToolDescription
fileFile type identification
stringsExtract printable strings (configurable min length)
readelfELF headers, sections, symbols, program headers
objdumpDisassembly, symbol tables, section contents
nmSymbol listing
hexdumpHex + ASCII dump at specific offsets
xxdHex dump (alternative format)
entropyShannon entropy per sliding window (detects encrypted/compressed data)

Plus final_answer — a structured submission tool the agent calls when done.

Setup

Prerequisites

  • Python 3.10+
  • Linux x86-64: gcc, binutils, file, xxd, python3 (for local builds and --no-docker mode)
  • macOS: Docker (for cross-compilation and sandboxed tool execution)

1. Clone and Configure

git clone https://github.com/agentrebench/AgentRE-Bench.git
cd AgentRE-Bench

# Create .env with your API key(s)
cp .env.example .env
# Edit .env — add at least one provider key

2. Build Binaries

chmod +x build_binaries.sh
./build_binaries.sh

On Linux x86-64: uses local gcc directly (install with apt install gcc if needed — no Docker required). On macOS / Apple Silicon: uses Docker with --platform linux/amd64 to cross-compile.

3. Build Tools Image

docker build --platform linux/amd64 -t agentre-bench-tools:latest -f Dockerfile.tools .

4. Run

# Single task with verbose output
python run_benchmark.py --task level1_TCPServer -v

# Full benchmark
python run_benchmark.py --all

# Different providers
python run_benchmark.py --all --provider anthropic --model claude-opus-4-6
python run_benchmark.py --all --provider openai --model gpt-4o
python run_benchmark.py --all --provider gemini --model gemini-2.0-flash
python run_benchmark.py --all --provider deepseek --model deepseek-chat

# Custom output directory
python run_benchmark.py --all --report results/opus_run1/

CLI Flags

FlagDefaultDescription
--allRun all 13 tasks
--task IDRun a single task by ID
--provideranthropicanthropic, openai, gemini, deepseek
--modelper-providerModel name
--api-keyfrom .envAPI key override
--report DIRresults/Output directory
--max-tool-calls25Tool call budget per task
--max-tokens4096Max tokens per LLM response
--no-dockerRun tools via local subprocess
-vVerbose: show agent reasoning + tool I/O live

Output

results/
  agent_outputs/              Raw agent JSON answers (one per task)
  transcripts/                Per-task scoring, metrics, and full message logs
  benchmark_report.json       Aggregate report with all metrics and scores

Standalone Scorer

The scorer works independently of the agent harness:

# Single sample
python scorer.py -g ground_truths/level1_TCPServer.json \
                 -a agent_outputs/level1_TCPServer.json

# Batch
python scorer.py -G ground_truths/ -A agent_outputs/ -r report.json

Known Limitations

  • Static analysis only — no dynamic execution, debugging, or sandboxed runtime. Tests static RE reasoning specifically.
  • Synthetic samples — designed to test real RE skills, but production malware has additional complexity (packers, anti-VM, polymorphism at scale) not fully represented.
  • Fixed tool set — agents can't install tools, write scripts, or use Ghidra/IDA. Standardizes evaluation but limits agent creativity.
  • Single-agent — no multi-agent collaboration or human-in-the-loop.
  • Token cost — a full 13-task run uses ~5-10M tokens on frontier models. Budget accordingly.
  • Linux/Unix only — all binaries are ELF x86-64 targeting Linux/Unix systems. No Windows PE, ARM, or MIPS samples yet.

Roadmap

  • Failure taxonomy — systematic categorization of failure modes across models
  • Baseline comparisons — published results for Claude, GPT, Gemini, and open models
  • Dynamic analysis tools — strace, ltrace, sandboxed execution
  • Windows PE support — Windows binaries, ARM targets, packed samples
  • Multi-turn refinement — tasks requiring iterative hypothesis refinement
  • Public leaderboard — model comparison across providers and versions

License

MIT