Evaluating API-Indexed Retrieval vs Textual Search for LLM Agent Codebase Navigation

February 28, 2026 · View on GitHub

Overview

Five real-world codebases. Five languages. One question: how much does a pre-indexed API surface save an LLM agent compared to Grep+Read?

Each benchmark follows a realistic cross-cutting research workflow — the kind of investigation an agent performs before implementing a feature. Every MCP token count comes from actual tool responses measured against the indexed project.

Cross-Language Results

LanguageProjectFilesRecordsMCPSkilledNaiveMCP vs SkilledMCP vs Naive
C#Unity game1291,0341,0214,45311,82577% fewer91% fewer
TypeScriptimmich6948,3441,4514,50014,55068% fewer90% fewer
Javaguava8918,3771,8514,20026,70056% fewer93% fewer
Gogin385341,7912,77015,30035% fewer88% fewer
Pythoncodesurface9407532,00010,40062% fewer93% fewer
TOTAL1,76118,3296,86717,92378,77562% fewer91% fewer

Methodology

Three agent strategies are compared for each workflow:

StrategyDescription
MCPPre-indexed API server. Returns only public signatures, pre-ranked. 1 tool call per lookup.
Skilled AgentTargeted Grep + partial Read. Uses Grep -C 3 for signatures, Read with offset/limit (~40 lines) for classes. Assumes the agent already knows file paths or finds them efficiently.
Naive AgentGrep to find file → Read the entire file. When multiple steps target the same file, the cost is counted on first access only (the agent retains file content in context).

Token estimate: len(text) / 4 (standard approximation for code).

Important caveats:

  • The Skilled Agent numbers assume optimal tool usage — real agents vary between Skilled and Naive.
  • Grep simulations use case-insensitive substring matching (similar to ripgrep). In practice, an agent might need multiple grep attempts.
  • MCP returns are pure signal (public API only). Grep+Read returns include implementation noise the LLM must mentally filter.
  • For small files (< 30 lines), all three approaches converge — the advantage grows with file size and codebase complexity.

C# — Unity Game (129 files, 1,034 records)

Feature: When a player wins a blast level, spawn reward items in their camp.

Why this workflow? It crosses both game modes (Blast and Camp), touches shared events, requires understanding DI wiring, and involves multiple services. This is a realistic cross-cutting feature that an LLM agent would need to research before implementing.

Results

#Developer QuestionMCPSkilledNaiveMCP vs Skilled
1What data comes back from a completed blast level?5711912052%
2How does the game mode context bridge work?931003327%
3What orchestrates game mode transitions?1224611,95074%
4What events fire when a blast level ends?6216339062%
5How do I spawn items into the camp grid?1531,6142,74691%
6How do I find empty cells near a location?22135383637%
7Where do I wire new camp logic? (entry point)545713,80391%
8What shared events bridge camp and blast?3419640783%
9What's the EventBus API?6040247685%
10Does a reward-related event or model already exist?16547476565%
TOTAL1,0214,45311,82577%

Highlighted step: Where do I wire new camp logic?

MCP returns 54 tokens — just the class declaration and 2 public methods:

Class: CampEntryPoint
Namespace: CampGame.Scopes
Declaration: class CampEntryPoint : IStartable, IDisposable
File: CampGame/Scopes/CampEntryPoint.cs

-- METHODS (2) --
  void Dispose()
  void Start()

Total: 2 members

The Skilled Agent reads ~40 lines (571 tokens). The Naive Agent reads the full file — 3,803 tokens of constructor injection, Start() body, Dispose() body, and private fields the agent doesn't need for discovery.

Key observation

The biggest gaps appear on class lookups for large files (steps 3, 5, 7). A 200-line file with 9 public methods contains ~190 lines of implementation the LLM doesn't need. MCP returns only the public surface.


TypeScript — immich (694 files, 8,344 records)

Feature: Add user notification when an album is shared via link.

Why this workflow? Immich is a self-hosted photo management app. Adding share notifications touches controllers, DTOs, repositories, the notification system, shared link infrastructure, and background jobs. This crosses multiple architectural layers in a real production codebase.

Results

#Developer QuestionToolMCPSkilledNaive
1How does album sharing work?search("album share")1703801,800
2What's the album controller API?get_class("AlbumController")3007502,100
3What notification system exists?search("notification")1454802,800
4What's the notification controller API?get_class("NotificationController")1605501,300
5What DTO carries notification data?get_class("NotificationDto")83260900
6How does authentication work?search("AuthDto")85380700
7How are shared links created?search("SharedLink create")1704502,000
8What's the notification storage layer?get_class("NotificationRepository")1205501,000
9What background job system exists?search("job queue")1333201,200
10What's the download/retrieval pattern?get_class("DownloadRepository")85380750
TOTAL1,4514,50014,550

Highlighted step: What's the album controller API?

MCP returns 300 tokens — 13 method signatures with their decorator patterns:

Class: AlbumController
Namespace: server.src.controllers.album.controller
Declaration: class AlbumController
File: server/src/controllers/album.controller.ts:25

-- METHODS (13) --
  addAssetsToAlbum(@Auth() auth: AuthDto, @Param() { id }: UUIDParamDto, @Body() dto: BulkIdsDto,)
  addUsersToAlbum(@Auth() auth: AuthDto, @Param() { id }: UUIDParamDto, @Body() dto: AddUsersDto,)
  constructor(private service: AlbumService)
  createAlbum(@Auth() auth: AuthDto, @Body() dto: CreateAlbumDto): Promise<AlbumResponseDto>
  deleteAlbum(@Auth() auth: AuthDto, @Param() { id }: UUIDParamDto)
  getAlbumInfo(@Auth() auth: AuthDto, @Param() { id }: UUIDParamDto, @Query() dto: AlbumInfoDto,)
  getAlbumStatistics(@Auth() auth: AuthDto): Promise<AlbumStatisticsResponseDto>
  getAllAlbums(@Auth() auth: AuthDto, @Query() query: GetAlbumsDto): Promise<AlbumResponseDto[]>
  removeAssetFromAlbum(...)
  removeUserFromAlbum(...)
  updateAlbumInfo(...)
  updateAlbumUser(...)

Total: 13 members

The agent immediately sees: AlbumController injects AlbumService, every endpoint uses @Auth(), users are added via addUsersToAlbum, and the sharing DTO is AddUsersDto. No method bodies, no import statements, no route decorators — pure API surface.

Key observation

TypeScript controllers with decorators (@Auth(), @Body(), @Param()) are particularly noisy to Grep+Read. Each method is 5-10 lines in the source but 1 line in MCP. The decorator-heavy patterns in NestJS-style apps amplify MCP's advantage.


Java — Guava (891 files, 8,377 records)

Feature: Build a cache-backed user profile lookup service using Guava's caching library.

Why this workflow? Guava is a foundational Java utility library (8,377 indexed records across 891 files). Its Javadoc-heavy source files make Grep+Read particularly expensive — a 500-line file might contain 300 lines of Javadoc. This workflow researches the full caching stack: interfaces, builders, loaders, stats, eviction, and async patterns.

Results

#Developer QuestionToolMCPSkilledNaive
1What cache types exist?search("Cache")2003502,400
2What's the Cache interface?get_class("Cache")2136002,400
3How does LoadingCache extend it?get_class("LoadingCache")1384501,200
4How do I implement a loader?get_class("CacheLoader")1756003,500
5How do I handle eviction callbacks?get_class("RemovalListener")75250700
6How do I configure async removal?get_signature("asynchronous")75200500
7How do I monitor cache performance?get_class("CacheStats")4506002,500
8How do I set cache size limits?get_signature("maximumSize")1753505,500
9What input validation exists?search("Preconditions check")1754006,000
10What async patterns complement caching?search("ListenableFuture")1754002,000
TOTAL1,8514,20026,700

Highlighted step: What input validation exists?

MCP returns 175 tokens — the top 5 Preconditions methods by relevance:

Found 5 result(s) for 'Preconditions check':

[METHOD] com.google.common.base.Preconditions.checkArgument(boolean)
  Signature: static void checkArgument(boolean expression)
  File: Preconditions.java:125

[METHOD] com.google.common.base.Preconditions.checkNotNull(T)
  Signature: static T checkNotNull(T reference)
  File: Preconditions.java:954

[METHOD] com.google.common.base.Preconditions.checkArgument(boolean,Object)
  Signature: static void checkArgument(boolean expression, Object errorMessage)
  File: Preconditions.java:139

...

The Naive Agent reads the full Preconditions.java — a 1,000+ line file with 80 method overloads (checkArgument × 26 parameter combos, checkNotNull × 26, checkState × 26, plus index checks). That's 6,000 tokens of nearly identical signatures the LLM must wade through. MCP's BM25 ranking surfaces the 5 most relevant methods.

Key observation

Java's Javadoc convention makes Naive reads extremely expensive. A 150-line interface can occupy 500 lines in source (3:1 doc-to-code ratio). Guava is worst-case for this — CacheBuilder.java has a 190-line class-level Javadoc before the first method. MCP strips all of this, returning only signatures and truncated summaries. The Javadoc-heavy pattern amplifies MCP's advantage: MCP vs Naive = 93% fewer tokens.


Go — gin (38 files, 534 records)

Feature: Add JWT authentication middleware to a gin REST API.

Why this workflow? gin is a compact but dense framework — 534 records in 38 files. The Context struct alone has 128 methods in a single 1,200-line file. This workflow shows how MCP handles "god files" through targeted get_signature lookups instead of reading 1,200 lines to find a 2-line method.

Results

#Developer QuestionToolMCPSkilledNaive
1How is middleware structured?search("HandlerFunc middleware")1503003,200
2How do I read request headers?get_signature("GetHeader")401209,600
3How do I reject unauthorized requests?get_signature("AbortWithStatusJSON")50150— †
4How do I store auth data in context?get_signature("Context.Set")45200— †
5How do I chain to the next handler?get_signature("Next")3880— †
6How does route grouping work?get_class("RouterGroup")4506001,600
7What error types exist?search("Error")130350900
8What built-in middleware exists?search("Logger Recovery")68200— ‡
9How does the engine start?get_class("Engine")780650— ‡
10How do I send JSON responses?get_signature("Context.JSON")40120— †
TOTAL1,7912,77015,300

context.go (1,200 lines) already loaded in step 2. ‡ gin.go (400 lines) already loaded in step 1.

Naive file reads: context.go (9,600) + gin.go (3,200) + routergroup.go (1,600) + errors.go (900) = 15,300 tokens from 4 unique files.

Highlighted step: How do I read request headers?

MCP returns 40 tokens — a single targeted signature:

[METHOD] gin.Context.GetHeader
  Signature: GetHeader(key string) string
  Summary: GetHeader returns value from request headers.
  File: context.go:1088

The Skilled Agent greps for GetHeader with 3 lines of context — 120 tokens. The Naive Agent reads all of context.go9,600 tokens (128 methods, binding helpers, cookie utilities, query parsing, multipart handling, template rendering, and serialization methods) to find a 2-line function.

This is where MCP's advantage is most dramatic: 40 tokens vs 9,600 tokens — a 240x reduction.

Key observation

gin's Context is a god struct — 128 methods in one file. This pattern is common in Go (net/http.Request, testing.T, etc.). MCP turns 1,200 lines into targeted 1-line lookups. However, for get_class("Engine") (38 members, 780 tokens), MCP returns MORE tokens than a Skilled Agent's targeted 50-line Read (650 tokens). MCP's advantage is strongest on search and signature lookups, not on full class dumps of large types.


Python — codesurface (9 files, 40 records)

Feature: Add a Rust parser to the codesurface project.

Why this workflow? This is a small codebase (9 files, 40 records) — deliberately included to show how MCP scales at the lower end. The scenario is realistic: a contributor needs to understand the parser plugin system, the base class contract, reference implementations, and the registration mechanism.

Results

#Developer QuestionToolMCPSkilledNaive
1What's the parser base class?get_class("BaseParser")75200200
2How does the Go parser work?get_class("GoParser")633001,200
3What parsers already exist?search("parse")2252006,200
4How do I register file extensions?search("extension")113250400
5How does the TypeScript parser differ?get_class("TypeScriptParser")633001,200
6How is parser selection handled?get_signature("get_parser")63150— †
7How does the Java parser work?get_class("JavaParser")633001,200
8What's the overall architecture?get_stats()88300— ‡
TOTAL7532,00010,400

__init__.py already loaded in step 4. ‡ No file equivalent.

Highlighted step: What parsers already exist?

MCP returns 225 tokens — a ranked overview of every parser in the project:

Found 10 result(s) for 'parse':

[TYPE] codesurface.parsers.python_parser.PythonParser
  Signature: class PythonParser(BaseParser)
  File: codesurface/parsers/python_parser.py:69

[TYPE] codesurface.parsers.go.GoParser
  Signature: class GoParser(BaseParser)
  File: codesurface/parsers/go.py:137

[TYPE] codesurface.parsers.java.JavaParser
  Signature: class JavaParser(BaseParser)
  File: codesurface/parsers/java.py:127

[TYPE] codesurface.parsers.csharp.CSharpParser
  Signature: class CSharpParser(BaseParser)
  File: codesurface/parsers/csharp.py:101

[TYPE] codesurface.parsers.typescript.TypeScriptParser
  Signature: class TypeScriptParser(BaseParser)
  File: codesurface/parsers/typescript.py:145

[METHOD] codesurface.parsers.get_parser(str)
  Signature: get_parser(lang: str) -> BaseParser
  File: codesurface/parsers/__init__.py:22

...

One tool call gives the contributor a complete map: all 5 parser classes, their file locations, and the get_parser() factory function. A Naive Agent would grep for "Parser" and read each of the 5 parser files (~200 lines each) = 6,200 tokens.

Key observation

For step 1 (BaseParser), both MCP and Skilled return ~200 tokens — the file is only 30 lines, so reading it whole is effectively free. This confirms that MCP's advantage scales with file size. The biggest savings come from step 3 (cross-file discovery) and steps 2/5/7 (parser reference files at ~200 lines each).


Cross-Language Analysis

Token efficiency by language

LanguageMCP TotalSkilled TotalNaive TotalMCP vs SkilledMCP vs Naive
C#1,0214,45311,8254.4x11.6x
TypeScript1,4514,50014,5503.1x10.0x
Java1,8514,20026,7002.3x14.4x
Go1,7912,77015,3001.5x8.5x
Python7532,00010,4002.7x13.8x

What drives the ratio?

FactorIncreases MCP advantageDecreases MCP advantage
File sizeLarge files (100+ lines) → more noise to skipSmall files (< 30 lines) → reading whole file is cheap
Javadoc/commentsHeavy documentation in source → inflates Grep+ReadMinimal comments → source is compact
Decorator patternsNestJS/Spring decorators add 3-5x line overheadPlain function definitions → 1 line per function
God files100+ members in one file → targeted lookup saves most1 class per file → little waste in full read
Codebase sizeMore files → harder for agent to find the right oneFew files → agent already knows where to look
Overloaded methodsJava's 80-overload Preconditions → MCP filters by relevanceUnique method names → grep finds exactly 1 match

Where MCP wins most

  1. Cross-file discovery (search tool): Finding which classes/events exist across a large codebase. The agent doesn't know file paths — MCP's FTS5 index finds them in 1 call.
  2. God file navigation (get_signature tool): Extracting 1 method from a 1,200-line file. MCP returns 40 tokens; Grep+Read returns 120-9,600 tokens.
  3. Large class reference (get_class tool): Getting 13 method signatures from a 160-line controller. MCP returns the public surface; Grep+Read includes method bodies.

Where MCP wins least

  1. Small files (< 30 lines): Reading the whole file is as cheap as the MCP response. Python's BaseParser (30 lines) shows 0% advantage.
  2. Large classes with huge docs: get_class("Engine") in gin returns 780 tokens (38 members). A Skilled Agent's targeted 50-line Read returns 650 tokens. MCP loses on this step.
  3. Single well-named classes: When a grep for className returns exactly 1 file and the file is small, the Skilled Agent nearly matches MCP.

Honest Assessment: Where MCP Is Insufficient

MCP returns public API surface only. There are cases in every workflow where that's not enough:

C# — CampEntryPoint (step 7)

MCP returns Start() and Dispose(). To wire a new controller, the agent needs the constructor (DI parameters) and the body of Start() (initialization sequence). Verdict: MCP saves discovery, but a follow-up Read is required.

TypeScript — AlbumController (step 2)

MCP shows all 13 methods, but the agent doesn't see the @Controller('albums') decorator, route path structure, or HTTP method annotations from the source. To understand URL routing, the agent needs the source. Verdict: MCP shows WHAT exists; the source shows HOW it's wired.

Java — CacheBuilder (not shown individually)

get_class("CacheBuilder") returns 23 methods + a massive Javadoc summary (~1,450 tokens). This is actually LARGER than a Skilled Agent's targeted Read. For classes with enormous Javadoc, MCP's class-level tool returns more than necessary. Verdict: Use get_signature for specific builder methods instead of get_class for the whole builder.

Go — Context (step 2)

If the agent uses get_class("Context") instead of targeted get_signature calls, the response is 2,500 tokens (128 methods). The Skilled Agent reading 60 lines gets 900 tokens. Verdict: For god structs, get_signature is better than get_class. MCP's advantage depends on using the right tool.

Python — BaseParser (step 1)

MCP returns 75 tokens. Reading the full 30-line file returns 200 tokens. The gap is 125 tokens — meaningful in aggregate but negligible for a single lookup. Verdict: Small codebases benefit less from MCP.


The Realistic Workflow: MCP + Targeted Read

MCP does not eliminate file reading. Across all 5 benchmarks, ~30% of steps require a follow-up Read after MCP discovery:

LanguageTotal StepsSteps Needing Follow-up ReadFollow-up Token Cost
C#103~1,032
TypeScript102~800
Java102~600
Go102~500
Python81~200
Total4810~3,132

Hybrid totals

WorkflowMCP DiscoveryFollow-up ReadsTotalvs Skilledvs Naive
MCP + Targeted Read6,8673,1329,99944% fewer87% fewer
Skilled Agent17,92377% fewer
Naive Agent78,775

Even with follow-up reads, the hybrid approach uses 44% fewer tokens than the Skilled Agent and 87% fewer than the Naive Agent.

The key insight: MCP eliminates the 38 exploratory lookups that don't need implementation detail, and narrows the 10 that do to targeted reads of known files at known line numbers.


Context Window Impact

The token savings become critical at smaller context windows:

WindowMCP+Read %Skilled %Naive %Impact
8K125%224%985%Only MCP completes the workflow (barely)
32K31%56%246%Naive agent exhausts context on research alone
128K8%14%62%Moderate advantage — more room for implementation
200K5%9%39%Marginal — optimization rather than necessity

At 8K context (common for smaller models and tool-use scenarios), only MCP leaves enough headroom for the agent to actually write code after researching. At 200K, all three fit comfortably.


Operational Cost

MCP requires a pre-indexing step that Grep+Read does not:

ProjectFilesRecordsParse TimeIndex TimeTotal
Unity game (C#)1291,0340.024s0.019s0.043s
gin (Go)38534<0.1s<0.1s<0.1s
immich (TypeScript)6948,3440.6s<0.1s0.6s
guava (Java)8918,3772.4s<0.1s2.4s
codesurface (Python)940<0.1s<0.1s<0.1s

Storage is in-memory SQLite (no disk I/O). The index rebuilds on server restart and updates incrementally via reindex() — only changed files are re-parsed.


Beyond Token Counting: Reducing Entropy in Agent Reasoning

The deeper value of a pre-indexed API is not token savings. It is determinism.

PropertyMCPGrep+Read
What the LLM seesCanonical API surface — every public member, nothing elseVariable — depends on grep pattern, context lines, file structure
Hallucination surfaceLow — signatures are authoritativeHigher — LLM may infer behavior from partial context
ConsistencySame query always returns same resultGrep results vary with pattern and file ordering
DiscoverySemantic search across all types (FTS5 + BM25)Pattern matching on text — misses abbreviations, PascalCase splits

Concrete example: inferring behavior from partial context

In the C# benchmark (step 2), the agent needs to know how ReportLevelCompleted works.

Grep+Read agent reads GameModeManager.cs and sees:

public void ReportLevelCompleted(LevelResult result)
{
    _currentMode?.OnLevelCompleted(result);
    UnloadGameMode();
}

The agent sees _currentMode?.OnLevelCompleted(result) and may infer that OnLevelCompleted publishes a LevelCompletedEvent — which does not exist. The actual event is LevelWonEvent, published elsewhere.

MCP agent sees only the signature:

void ReportLevelCompleted(LevelResult result)

No implementation to over-interpret. When the agent needs level-end events, it runs search("LevelWon") and gets the correct answer directly.

The critical difference is not tokens. It is inference chain reliability. The Grep+Read agent made a plausible wrong inference from true context. The MCP agent never had that gap — the tool's constrained output forced explicit disambiguation.

Measurable outcomes (not captured here)

  • Fewer hallucinated method signatures
  • Fewer incorrect parameter types
  • Fewer "let me check that file again" round-trips
  • Faster convergence to correct implementation

These require A/B testing with real agent task completion, which is a meaningful next step.