LAP Benchmark

March 2, 2026 ยท View on GitHub

View Full Benchmark Report -- 500 runs, 50 specs, 5 formats

Measures how well LAP-compressed API documentation performs compared to original formats when given to AI coding agents.

Benchmark Matrix

  • 50 production API specs across 5 formats (OpenAPI, AsyncAPI, GraphQL, Postman, Protobuf)
  • 5 documentation tiers (4 compression levels + no-doc baseline)
  • 2 tasks per spec = 500 runs per model
  • Automated scoring: endpoint identification 60%, parameter accuracy 30%, code quality 10%

Documentation Tiers

TierDescription
NoneNo documentation provided (prior-knowledge baseline)
PrettyOriginal spec, properly formatted
MinifiedWhitespace and comments stripped
LAP StandardFull LAP format with descriptions
LAP LeanLAP format, types only (maximum compression)

Spec Coverage

OpenAPI (10)

Figma, Stripe, Twilio, GitHub REST, DigitalOcean, Slack, Spotify, Box, Plaid, Resend

AsyncAPI (10)

Streetlights, Slack RTM, Adeo Kafka, Social Media, Gitter Streaming, Gemini WebSocket, Kraken WebSocket, Correlation ID, Operation Security, RPC Server

GraphQL (10)

GitHub, SWAPI, Yelp, Shopify, Artsy, Linear, Saleor, Elasticsearch, Coral, Unraid

Postman (10)

Twilio, Postman Echo, Adobe, SAP, Stripe, Azure DevOps, Auth0, Braintree, InfluxDB, Akeneo

Protobuf / gRPC (10)

Google Storage, Pub/Sub, Vision, Data Catalog, Translate, Spanner, Firestore, Talent, Language, Billing

Project Structure

registry/       Spec definitions + task manifests with ground truth
sources/        Raw specs fetched from GitHub
compiled/       200 doc variants (50 specs x 4 tiers)
harness/        Benchmark runner, executor, scorer
prompts/        Agent prompt template
scripts/        Compilation, validation, analysis
results/        Benchmark results and reports

Quick Start

pip install tiktoken pyyaml   # dependencies

# Compile all doc variants
python scripts/compile_variants.py

# Validate registry + manifests
python scripts/validate_registry.py

# Pilot run (7 specs, all tiers)
python -m harness.runner --pilot

# Full run (50 specs, all tiers)
python -m harness.runner --full

Scoring

Each run produces a score from 0.0 to 1.0 across three components:

  • Endpoint identification (60%) - Did the agent call the correct API endpoint?
  • Parameter accuracy (30%) - Did the agent use the right parameters? Matched via word-boundary detection in structured CALL blocks and code blocks only.
  • Code quality (10%) - Does the generated Python code reference the correct endpoints and parameters?

Confounding Variable Controls

  • No-doc baseline - none tier measures what the model knows without any documentation
  • Neutral filenames - All docs delivered as api_docs.txt (no tier or format leakage)
  • Business-language tasks - Task descriptions avoid endpoint-revealing technical terms
  • Python only - Eliminates language choice as a variable
  • No library hints - Prompt says "appropriate libraries", not specific package names
  • Word-boundary param matching - Parameters scored only in structured sections, not free prose

Status

Pilot completed: 70 runs across 7 specs, all 5 tiers. Full benchmark run pending.