Code Audit Methodology

April 26, 2026 · View on GitHub

View Landing Page | GitHub

This document explains the research-backed principles behind Heavy3 Code Audit's design.

The Synthesis Table - Our trademark feature
The Council - Three specialized reviewers
Why Multi-Model? - Research backing
Plan Review - Review before you code
Context Engineering - What context to send
Context Positioning - Lost in the Middle
Large Change Handling - 50+ files, 10K+ lines
Implementation Details - Code references
Academic References - Peer-reviewed sources

The Synthesis Table (Trademark Feature)

After council review, Claude synthesizes all findings into a 3-column comparison table—our trademark feature that differentiates Heavy3 Code Audit from other tools.

Example output:

Aspect	Correctness (GPT 5.5)	Performance (Gemini 3.1)	Security (Grok 4)
Focus	Bugs, Logic, Edge Cases	Scaling, Memory, N+1	Vulnerabilities, Auth
Findings	❌ Null check missing	⚠️ Potential N+1 query	✅ No issues found
Verdict	REQUEST CHANGES	APPROVE WITH NOTES	APPROVE

Legend: ✅ = No issues | ⚠️ = Warning | ❌ = Critical issue

What you get:

Consensus Issues - Problems flagged by 2+ reviewers (high confidence)
Notable Findings - Unique insights from each specialist
Final Recommendation - APPROVE / APPROVE WITH CHANGES / REQUEST CHANGES
Priority Actions - Ranked list of fixes

Why this matters:

Shows exactly where reviewers agree and disagree
Provides transparency no other tool offers
Helps you make informed decisions about which issues to address

The Council

Three specialized reviewers, each with web search:

Role	Model	Focus	Search
Correctness Expert	GPT 5.5	Bugs, logic errors, edge cases, race conditions	Bing
Performance Critic	Gemini 3.1 Pro	N+1 queries, memory leaks, scaling bottlenecks	Exa
Security Analyst	Grok 4	Vulnerabilities, auth issues, data exposure	Exa

Why Grok 4 for Security?

Grok 4 was selected as Security Analyst based on independent security benchmarks:

Benchmark	Score	Source
Kilo AI Exploit Test	100% detection on advanced exploits	Kilo Blog
WMDP-Cyber	79-81% accuracy (vulnerability detection, reverse engineering)	xAI Model Card
CyBench CTF	43% success on 40 capture-the-flag challenges	xAI Model Card
Veracode Security	55% secure code generation (mid-tier)	Veracode

The Kilo AI test specifically evaluated Grok 4 on prototype pollution, agentic AI supply-chain attacks, and OS command injection—achieving fix quality scores of 83-85/100 with references to OWASP AI Top 10 and NIST AI RMF standards.

Three Pillars of Diversity:

Pillar	Implementation	Benefit
Different Models	GPT 5.5 + Gemini 3.1 Pro + Grok 4	Different training data, different blind spots
Specialized Roles	Correctness + Performance + Security	Forces comprehensive coverage
Different Search Sources	Bing + Exa	Facts verified across independent indexes

You code with Claude. Our council (GPT + Gemini + Grok) catches what Claude misses.

Why Multi-Model?

Single models classify code correctness only ~68% of the time. Research on LLM code review accuracy shows significant room for improvement:

Study	Finding
LLM Code Review (2025)	GPT-4o correctly classifies code correctness 68.50% of the time
Gemini Flash Study	Gemini 2.0 Flash achieves 63.89% accuracy
False Positive Rate	Up to 24.80% of correct code receives incorrect suggestions

Every model has different blind spots. The GPT-5.4 / Gemini 3.1 Pro / Opus 4.5 numbers below come from Sonar's Dec 2025 analysis of millions of lines of generated code. Heavy3 has since upgraded the Correctness role from GPT 5.4 to GPT 5.5, but the underlying insight — that each architecture fails differently — still motivates the council design.

Model	Strength	Weakness
GPT-5.4	Cleanest control flow (22/MLOC)	2x more concurrency bugs (470/MLOC)
Gemini 3.1 Pro	Highest pass rate (81.7%)	4x more control flow mistakes (200/MLOC)
Opus 4.5	Best overall accuracy	Lowest error rate (55/MLOC)

The insight: The probability of two different architectures hallucinating the same bug is significantly lower than one. Using 3 specialized models catches what any single model misses.

What Context to Send for High-Quality Reviews

Research Finding: Quality of context matters more than quantity. Well-selected 8K-32K tokens outperforms noisy 100K+ contexts.

Based on LAURA (Zhang et al., IEEE ASE 2025) and CodeRabbit's context engineering:

Context Type	What to Include	Priority	Our Implementation
Conversation History	Original request, approach notes, 3-5 relevant exchanges	Essential	`conversation_context` dict
Code Diff	Changed lines + 3-5 surrounding lines	Essential	`git diff HEAD`
Commit Messages	Fine-grained change documentation	Essential	PR body included
File Paths	Full paths of changed files	Essential	`changed_files` array
Full File Contents	Complete current state of changed files	High	`file_contents` dict
Related Tests	Test files matching changed files	High	`test_files` dict
Problem Description	PR description, issue context	High	`pr_metadata.body`
Documentation	CLAUDE.md, architecture docs	Medium	`documentation` dict
Cross-File Dependencies	Files that import/call changed code	For breaking changes	`dependent_files` dict

Key Research Insight: "Incorporating problem descriptions into prompts consistently improved performance, highlighting the importance of code comments and pull request descriptions." (Evaluating LLMs for Code Review, 2025)

Context Positioning: The "Lost in the Middle" Problem

The Problem: LLMs exhibit a U-shaped attention curve where they best process information at the beginning and end of long contexts, while information in the middle is often neglected.

Research Source: Liu et al., "Lost in the Middle: How Language Models Use Long Contexts" (TACL 2024)

"Performance is often highest when relevant information occurs at the beginning or end of the input context, and significantly degrades when models must access relevant information in the middle of long contexts."

Psychological Parallel: This mirrors the "serial-position effect" (Ebbinghaus, 1913) in human psychology.

Our Implementation (build_user_message() in review.py and council.py):

┌─────────────────────────────────────────┐
│ START: Critical context                  │
│   - Developer intent (conversation)      │  ← conversation_context first
│   - PR description (intent)              │  ← pr_metadata second
│   - Code diff (actual changes)           │  ← diff third
├─────────────────────────────────────────┤
│ MIDDLE: Supporting context               │
│   - Full file contents                   │  ← file_contents
│   - Documentation                        │  ← documentation
│   - Test files                           │  ← test_files
│   - Cross-file dependencies              │  ← dependent_files
├─────────────────────────────────────────┤
│ END: Instructions (in system prompt)     │
│   - Review focus areas                   │  ← SYSTEM_PROMPT
│   - Output format requirements           │
└─────────────────────────────────────────┘

Multi-Model Consensus: Why Three Reviewers

The Problem with Single-Model Review (Sonar LLM Leaderboard, Dec 2025):

Even top-tier models have significant blind spots. Per million lines of generated code:

Model	Pass Rate	Weakness	Error Rate
GPT-5.4 High	80.66%	Concurrency errors	470/MLOC (2x others)
Gemini 3.1 Pro	81.72%	Control flow mistakes	200/MLOC (4x best)
Claude Sonnet 4.5	~77%	Resource management leaks	195/MLOC
Opus 4.5 Thinking	Best pass rate	Control flow	55/MLOC (lowest)

Key Insight: Each model excels in different areas and fails in different ways. Multi-model consensus catches what any single model misses.

Research Support:

Hashgraph-Inspired Consensus (2025): Treating each model as a "peer" in a distributed consensus system achieves "blockchain-grade properties of consistency and Byzantine fault tolerance."
Disagreement as Data (Jan 2025): "LLM agents' semantic reasoning similarity robustly differentiates consensus from disagreement and correlates with human coding reliability."

Plan/Architecture Review: The Missing Piece

No major code review tool effectively reviews plans before implementation.

Most tools focus on implementation-level code review:

Syntax errors, linting violations
Bug detection after code is written
Code style compliance

Heavy3 Code Audit uniquely offers pre-implementation validation:

Review Type	Focus	Council Roles
Architecture Design	Patterns, SOLID, separation of concerns	Design Expert (GPT 5.5)
Plan Feasibility	Is this approach realistic? What are the risks?	All three reviewers
Scalability Assessment	Will this scale? What are the bottlenecks?	Scalability Analyst (Gemini 3.1 Pro)
Security Architecture	Threat model, attack surface, auth design	Security Architect (Grok 4)

Why this matters for AI-assisted coding (vibe coding):

The plan determines 80% of success
Catching design issues early saves massive rework
Architecture review is where multi-model consensus shines

Usage:

/h3 plan.md           # Review a specific plan file
/h3                   # Smart detection finds plan.md in cwd or ~/.claude/plans/

Our Implementation (Verified)

Documented Practice	Code Location	Status
Conversation context at very start	`review.py:350-366`	Verified
Diff at start of context	`review.py:385-388`	Verified
Full file contents included	`review.py:390-393`	Verified
Test files included	`review.py:400-403`	Verified
Documentation included	`review.py:395-398`	Verified
PR metadata for PR reviews	`review.py:373-383`	Verified
Context truncation with marker	`review.py:472-473`	Verified
Retry with exponential backoff	`review.py:38-70`	Verified
Streaming response	`review.py:531-556`	Verified
3-model parallel execution	`council.py:514-540`	Verified
Specialized prompts per role	`council.py:73-226`	Verified
Web search integration	`council.py:375-377`	Verified
Cross-file dependencies	`review.py:406-409`, `council.py:333-336`	Verified

JSON Structure Sent to Models:

{
  "review_type": "code|plan|pr",
  "conversation_context": {
    "original_request": "Add logout button to navbar",
    "approach_notes": "Using existing Button component",
    "relevant_exchanges": [
      {"role": "user", "content": "Can you add a logout button?"},
      {"role": "assistant", "content": "I'll add it using the existing Button component..."}
    ],
    "previous_review_findings": "Prior review suggested adding confirmation dialog"
  },
  "diff": "--- a/src/foo.ts\n+++ b/src/foo.ts\n...",
  "file_contents": {
    "src/foo.ts": "// Full file content..."
  },
  "test_files": {
    "src/foo.test.ts": "// Test content..."
  },
  "dependent_files": {
    "src/bar.ts": "import { foo } from './foo';\n...\nfoo(data);"
  },
  "documentation": {
    "CLAUDE.md": "// Project guidelines..."
  },
  "pr_metadata": {
    "number": 123,
    "title": "Add feature X",
    "body": "This PR adds..."
  }
}

Context Engineering Best Practices

Based on CodeRabbit's "Context Engineering" blog and our implementation:

Deduplicate Redundant Context: We use a single-pass context builder that avoids duplication.
Respect Token Budgets:
- 200K tokens (max_context)
- Graceful truncation with [... truncated due to length ...] marker
Include Developer Intent: PR descriptions and commit context included via pr_metadata.
Streaming for UX: Single mode streams tokens as they arrive.
Progress Indicators: Council shows completion status for each model.

Large Change Handling

For changes exceeding context limits (50+ files, 10K+ lines):

Detection: Count files and lines via git stats
Module Grouping: Break into logical modules
Sequential Review: Review each module with progress tracking
Cross-Module Summary: Final pass for dependencies

References

Peer-Reviewed Academic Work

Published in peer-reviewed venues (conferences, journals):

Paper	Venue	Year	Key Finding
Lost in the Middle	TACL	2024	U-shaped attention; position info at start/end
ReConcile	ACL	2024	Multi-model consensus improves reasoning 11.4%
AI-powered Code Review	ICSE	2024	Multi-agent systems produce consistent outcomes
LAURA	IEEE ASE	2025	Commit messages + file paths + code context = optimal

Academic Preprints

ArXiv preprints (not yet peer-reviewed):

Paper	Year	Key Finding
Combining LLMs with Static Analyzers	2025	Hybrid approach improves accuracy 16%
Disagreement as Data	2025	Model disagreement correlates with reliability
Evaluating LLMs for Code Review	2025	PR descriptions improve performance

Industry Analysis

Reports and analysis from industry sources:

Source	Organization	Key Finding
LLM Code Quality Analysis	Sonar	Each model has distinct blind spots in generated code
Context Engineering	CodeRabbit	Quality of context > quantity for reviews
Handling Ballooning Context	CodeRabbit	Strategic context selection for large codebases
LLM Coding Workflow 2026	Addy Osmani	Multi-model validation best practices

Summary

Principle	Research Basis	Implementation
Quality > quantity	Context quality > size	Smart selection of relevant files
Position matters	Lost in the Middle	Intent first, diff second, docs in middle
Diversity reduces errors	Multi-model consensus	3 models + roles + search sources
Include intent	Problem descriptions	Conversation context + PR body
Specialized focus	Role-based prompting	Correctness/Performance/Security
Graceful degradation	Token limits	Truncation markers, module-by-module
Streaming UX	Response latency	Single streams, Council shows progress
Retry resilience	Transient failures	Exponential backoff (2s, 4s, 8s)