Code Audit Methodology

April 26, 2026 · View on GitHub

View Landing Page | GitHub

This document explains the research-backed principles behind Heavy3 Code Audit's design.

Table of Contents

  1. The Synthesis Table - Our trademark feature
  2. The Council - Three specialized reviewers
  3. Why Multi-Model? - Research backing
  4. Plan Review - Review before you code
  5. Context Engineering - What context to send
  6. Context Positioning - Lost in the Middle
  7. Large Change Handling - 50+ files, 10K+ lines
  8. Implementation Details - Code references
  9. Academic References - Peer-reviewed sources

The Synthesis Table (Trademark Feature)

After council review, Claude synthesizes all findings into a 3-column comparison table—our trademark feature that differentiates Heavy3 Code Audit from other tools.

Example output:

AspectCorrectness (GPT 5.5)Performance (Gemini 3.1)Security (Grok 4)
FocusBugs, Logic, Edge CasesScaling, Memory, N+1Vulnerabilities, Auth
Findings❌ Null check missing⚠️ Potential N+1 query✅ No issues found
VerdictREQUEST CHANGESAPPROVE WITH NOTESAPPROVE

Legend: ✅ = No issues | ⚠️ = Warning | ❌ = Critical issue

What you get:

  • Consensus Issues - Problems flagged by 2+ reviewers (high confidence)
  • Notable Findings - Unique insights from each specialist
  • Final Recommendation - APPROVE / APPROVE WITH CHANGES / REQUEST CHANGES
  • Priority Actions - Ranked list of fixes

Why this matters:

  • Shows exactly where reviewers agree and disagree
  • Provides transparency no other tool offers
  • Helps you make informed decisions about which issues to address

The Council

Three specialized reviewers, each with web search:

RoleModelFocusSearch
Correctness ExpertGPT 5.5Bugs, logic errors, edge cases, race conditionsBing
Performance CriticGemini 3.1 ProN+1 queries, memory leaks, scaling bottlenecksExa
Security AnalystGrok 4Vulnerabilities, auth issues, data exposureExa

Why Grok 4 for Security?

Grok 4 was selected as Security Analyst based on independent security benchmarks:

BenchmarkScoreSource
Kilo AI Exploit Test100% detection on advanced exploitsKilo Blog
WMDP-Cyber79-81% accuracy (vulnerability detection, reverse engineering)xAI Model Card
CyBench CTF43% success on 40 capture-the-flag challengesxAI Model Card
Veracode Security55% secure code generation (mid-tier)Veracode

The Kilo AI test specifically evaluated Grok 4 on prototype pollution, agentic AI supply-chain attacks, and OS command injection—achieving fix quality scores of 83-85/100 with references to OWASP AI Top 10 and NIST AI RMF standards.

Three Pillars of Diversity:

PillarImplementationBenefit
Different ModelsGPT 5.5 + Gemini 3.1 Pro + Grok 4Different training data, different blind spots
Specialized RolesCorrectness + Performance + SecurityForces comprehensive coverage
Different Search SourcesBing + ExaFacts verified across independent indexes

You code with Claude. Our council (GPT + Gemini + Grok) catches what Claude misses.


Why Multi-Model?

Single models classify code correctness only ~68% of the time. Research on LLM code review accuracy shows significant room for improvement:

StudyFinding
LLM Code Review (2025)GPT-4o correctly classifies code correctness 68.50% of the time
Gemini Flash StudyGemini 2.0 Flash achieves 63.89% accuracy
False Positive RateUp to 24.80% of correct code receives incorrect suggestions

Every model has different blind spots. The GPT-5.4 / Gemini 3.1 Pro / Opus 4.5 numbers below come from Sonar's Dec 2025 analysis of millions of lines of generated code. Heavy3 has since upgraded the Correctness role from GPT 5.4 to GPT 5.5, but the underlying insight — that each architecture fails differently — still motivates the council design.

ModelStrengthWeakness
GPT-5.4Cleanest control flow (22/MLOC)2x more concurrency bugs (470/MLOC)
Gemini 3.1 ProHighest pass rate (81.7%)4x more control flow mistakes (200/MLOC)
Opus 4.5Best overall accuracyLowest error rate (55/MLOC)

The insight: The probability of two different architectures hallucinating the same bug is significantly lower than one. Using 3 specialized models catches what any single model misses.


What Context to Send for High-Quality Reviews

Research Finding: Quality of context matters more than quantity. Well-selected 8K-32K tokens outperforms noisy 100K+ contexts.

Based on LAURA (Zhang et al., IEEE ASE 2025) and CodeRabbit's context engineering:

Context TypeWhat to IncludePriorityOur Implementation
Conversation HistoryOriginal request, approach notes, 3-5 relevant exchangesEssentialconversation_context dict
Code DiffChanged lines + 3-5 surrounding linesEssentialgit diff HEAD
Commit MessagesFine-grained change documentationEssentialPR body included
File PathsFull paths of changed filesEssentialchanged_files array
Full File ContentsComplete current state of changed filesHighfile_contents dict
Related TestsTest files matching changed filesHightest_files dict
Problem DescriptionPR description, issue contextHighpr_metadata.body
DocumentationCLAUDE.md, architecture docsMediumdocumentation dict
Cross-File DependenciesFiles that import/call changed codeFor breaking changesdependent_files dict

Key Research Insight: "Incorporating problem descriptions into prompts consistently improved performance, highlighting the importance of code comments and pull request descriptions." (Evaluating LLMs for Code Review, 2025)


Context Positioning: The "Lost in the Middle" Problem

The Problem: LLMs exhibit a U-shaped attention curve where they best process information at the beginning and end of long contexts, while information in the middle is often neglected.

Research Source: Liu et al., "Lost in the Middle: How Language Models Use Long Contexts" (TACL 2024)

"Performance is often highest when relevant information occurs at the beginning or end of the input context, and significantly degrades when models must access relevant information in the middle of long contexts."

Psychological Parallel: This mirrors the "serial-position effect" (Ebbinghaus, 1913) in human psychology.

Our Implementation (build_user_message() in review.py and council.py):

┌─────────────────────────────────────────┐
│ START: Critical context                  │
│   - Developer intent (conversation)      │  ← conversation_context first
│   - PR description (intent)              │  ← pr_metadata second
│   - Code diff (actual changes)           │  ← diff third
├─────────────────────────────────────────┤
│ MIDDLE: Supporting context               │
│   - Full file contents                   │  ← file_contents
│   - Documentation                        │  ← documentation
│   - Test files                           │  ← test_files
│   - Cross-file dependencies              │  ← dependent_files
├─────────────────────────────────────────┤
│ END: Instructions (in system prompt)     │
│   - Review focus areas                   │  ← SYSTEM_PROMPT
│   - Output format requirements           │
└─────────────────────────────────────────┘

Multi-Model Consensus: Why Three Reviewers

The Problem with Single-Model Review (Sonar LLM Leaderboard, Dec 2025):

Even top-tier models have significant blind spots. Per million lines of generated code:

ModelPass RateWeaknessError Rate
GPT-5.4 High80.66%Concurrency errors470/MLOC (2x others)
Gemini 3.1 Pro81.72%Control flow mistakes200/MLOC (4x best)
Claude Sonnet 4.5~77%Resource management leaks195/MLOC
Opus 4.5 ThinkingBest pass rateControl flow55/MLOC (lowest)

Key Insight: Each model excels in different areas and fails in different ways. Multi-model consensus catches what any single model misses.

Research Support:

  1. Hashgraph-Inspired Consensus (2025): Treating each model as a "peer" in a distributed consensus system achieves "blockchain-grade properties of consistency and Byzantine fault tolerance."

  2. Disagreement as Data (Jan 2025): "LLM agents' semantic reasoning similarity robustly differentiates consensus from disagreement and correlates with human coding reliability."


Plan/Architecture Review: The Missing Piece

No major code review tool effectively reviews plans before implementation.

Most tools focus on implementation-level code review:

  • Syntax errors, linting violations
  • Bug detection after code is written
  • Code style compliance

Heavy3 Code Audit uniquely offers pre-implementation validation:

Review TypeFocusCouncil Roles
Architecture DesignPatterns, SOLID, separation of concernsDesign Expert (GPT 5.5)
Plan FeasibilityIs this approach realistic? What are the risks?All three reviewers
Scalability AssessmentWill this scale? What are the bottlenecks?Scalability Analyst (Gemini 3.1 Pro)
Security ArchitectureThreat model, attack surface, auth designSecurity Architect (Grok 4)

Why this matters for AI-assisted coding (vibe coding):

  • The plan determines 80% of success
  • Catching design issues early saves massive rework
  • Architecture review is where multi-model consensus shines

Usage:

/h3 plan.md           # Review a specific plan file
/h3                   # Smart detection finds plan.md in cwd or ~/.claude/plans/

Our Implementation (Verified)

Documented PracticeCode LocationStatus
Conversation context at very startreview.py:350-366Verified
Diff at start of contextreview.py:385-388Verified
Full file contents includedreview.py:390-393Verified
Test files includedreview.py:400-403Verified
Documentation includedreview.py:395-398Verified
PR metadata for PR reviewsreview.py:373-383Verified
Context truncation with markerreview.py:472-473Verified
Retry with exponential backoffreview.py:38-70Verified
Streaming responsereview.py:531-556Verified
3-model parallel executioncouncil.py:514-540Verified
Specialized prompts per rolecouncil.py:73-226Verified
Web search integrationcouncil.py:375-377Verified
Cross-file dependenciesreview.py:406-409, council.py:333-336Verified

JSON Structure Sent to Models:

{
  "review_type": "code|plan|pr",
  "conversation_context": {
    "original_request": "Add logout button to navbar",
    "approach_notes": "Using existing Button component",
    "relevant_exchanges": [
      {"role": "user", "content": "Can you add a logout button?"},
      {"role": "assistant", "content": "I'll add it using the existing Button component..."}
    ],
    "previous_review_findings": "Prior review suggested adding confirmation dialog"
  },
  "diff": "--- a/src/foo.ts\n+++ b/src/foo.ts\n...",
  "file_contents": {
    "src/foo.ts": "// Full file content..."
  },
  "test_files": {
    "src/foo.test.ts": "// Test content..."
  },
  "dependent_files": {
    "src/bar.ts": "import { foo } from './foo';\n...\nfoo(data);"
  },
  "documentation": {
    "CLAUDE.md": "// Project guidelines..."
  },
  "pr_metadata": {
    "number": 123,
    "title": "Add feature X",
    "body": "This PR adds..."
  }
}

Context Engineering Best Practices

Based on CodeRabbit's "Context Engineering" blog and our implementation:

  1. Deduplicate Redundant Context: We use a single-pass context builder that avoids duplication.

  2. Respect Token Budgets:

    • 200K tokens (max_context)
    • Graceful truncation with [... truncated due to length ...] marker
  3. Include Developer Intent: PR descriptions and commit context included via pr_metadata.

  4. Streaming for UX: Single mode streams tokens as they arrive.

  5. Progress Indicators: Council shows completion status for each model.


Large Change Handling

For changes exceeding context limits (50+ files, 10K+ lines):

  1. Detection: Count files and lines via git stats
  2. Module Grouping: Break into logical modules
  3. Sequential Review: Review each module with progress tracking
  4. Cross-Module Summary: Final pass for dependencies

References

Peer-Reviewed Academic Work

Published in peer-reviewed venues (conferences, journals):

PaperVenueYearKey Finding
Lost in the MiddleTACL2024U-shaped attention; position info at start/end
ReConcileACL2024Multi-model consensus improves reasoning 11.4%
AI-powered Code ReviewICSE2024Multi-agent systems produce consistent outcomes
LAURAIEEE ASE2025Commit messages + file paths + code context = optimal

Academic Preprints

ArXiv preprints (not yet peer-reviewed):

PaperYearKey Finding
Combining LLMs with Static Analyzers2025Hybrid approach improves accuracy 16%
Disagreement as Data2025Model disagreement correlates with reliability
Evaluating LLMs for Code Review2025PR descriptions improve performance

Industry Analysis

Reports and analysis from industry sources:

SourceOrganizationKey Finding
LLM Code Quality AnalysisSonarEach model has distinct blind spots in generated code
Context EngineeringCodeRabbitQuality of context > quantity for reviews
Handling Ballooning ContextCodeRabbitStrategic context selection for large codebases
LLM Coding Workflow 2026Addy OsmaniMulti-model validation best practices

Summary

PrincipleResearch BasisImplementation
Quality > quantityContext quality > sizeSmart selection of relevant files
Position mattersLost in the MiddleIntent first, diff second, docs in middle
Diversity reduces errorsMulti-model consensus3 models + roles + search sources
Include intentProblem descriptionsConversation context + PR body
Specialized focusRole-based promptingCorrectness/Performance/Security
Graceful degradationToken limitsTruncation markers, module-by-module
Streaming UXResponse latencySingle streams, Council shows progress
Retry resilienceTransient failuresExponential backoff (2s, 4s, 8s)