geo-benchmark-utility

June 11, 2026 · View on GitHub

Measure how visible your brand is in LLM answers — hit rate, semantic rank (MRR), competitor share of voice, citations, and sentiment across OpenAI, Anthropic, Google, Perplexity, xAI, and OpenRouter, with bootstrap confidence intervals and a local dashboard.

geobench dashboard overview

Dashboard

geobench dash serves a local, read-only dashboard at http://localhost:5173.


Provider breakdown — hit rate with 95% CI, MRR, SoV, citation rate/share, latency, errors, plus answer-intent and sentiment distributions per engine.	Share of voice & citations — target vs. declared competitors across all providers, and the domains LLMs actually cite.
Query drill-down — every scored response with rank, intent, and sentiment; expand a row to read the answer, the judge's evidence quote, and per-layer scores.	Trends — per-provider movement over time for any metric, scoped to a single product so runs stay comparable.
Run comparison — baseline-vs-candidate deltas per provider, color-coded by whether the change is good.	Screenshots use seeded demo data (`bun scripts/seed-demo-runs.ts`) — no API spend required to try the dashboard.

Install

bun install

Usage

Run a benchmark

# Estimate cost first (no API calls)
geobench estimate --product fixtures/sample.yaml --providers openai,anthropic

# Run benchmark (requires API keys in .env)
cp .env.example .env
# Edit .env with your API keys
geobench bench --product fixtures/sample.yaml --providers openai,anthropic --mode benchmark

# Resume an interrupted run
geobench bench resume --run-id <run-id>

View results

# Launch dashboard (http://localhost:5173)
geobench dash

# Compare two runs
geobench diff --run-id-1 <id1> --run-id-2 <id2>

The dashboard shows per-provider hit rate / MRR / share of voice / citation metrics with 95% bootstrap CIs, competitor share-of-voice leaderboards, sentiment and intent distributions, top cited domains, quality advisories, multi-run trends, and run-vs-run diffs. Runs are indexed into runs/index.db automatically after geobench bench; backfill older runs with geobench db reindex.

To explore the dashboard without spending API money, seed deterministic demo runs (clearly labeled, demo- prefixed):

bun scripts/seed-demo-runs.ts

Probe providers

# Verify provider API surfaces (requires API keys)
geobench probe

Database

# Run migrations
geobench db migrate

# Dry-run migrations
geobench db migrate --dry-run

# Rebuild the dashboard index from raw.jsonl artifacts (backfill old runs)
geobench db reindex

Configuration

Copy .env.example to .env and fill in your API keys:

OPENAI_API_KEY — OpenAI
ANTHROPIC_API_KEY — Anthropic Claude
GOOGLE_GENERATIVE_AI_API_KEY — Google Gemini
PERPLEXITY_API_KEY — Perplexity Sonar
XAI_API_KEY — xAI Grok
OPENROUTER_API_KEY — OpenRouter (covers Groq, Together, Mistral)

Product Spec

Create a YAML product spec (see fixtures/sample.yaml for a Korean example):

name: "Your Product Name"
aliases: ["Alias1", "Alias2"]
romanizations: ["RomanizedAlias"]
category: "product-category"
description: "Brief product description"
competitors: ["Rival A", "라이벌 B"] # required for industry-standard share of voice
cited_domains: ["yourproduct.com"]
target_languages: ["en", "ko"]
target_audience: ["consumer"]

Methodology

geobench reports two distinct metric systems that should not be conflated: GEO operational metrics (does my brand show up well in LLM answers?) and surface fidelity (does the API match the browser UI for the same query?). See docs/methodology.md for the full definitions.

GEO Operational Metrics

These metrics answer: "How visible is the target product across LLM-generated responses?"

Metric	Question it answers	Formula
Hit Rate (± 95% CI)	Was the target mentioned?	`#hits / total_queries`
MRR (± 95% CI)	At what rank was the target recommended?	`(1/N) × Σ(1/rank_i)` (non-mentions = 0)
Share of Voice	Of all brand mentions (target + competitors), what fraction is mine?	`target_mentions / (target_mentions + competitor_mentions)`
Provider Hit Share	What fraction of cross-provider hits comes from this provider?	`hits(p) / total_hits_all_providers`
Citation Rate (± 95% CI)	Was my domain cited as a source?	`#queries_with_target_domain / total_queries`
Citation Share	How much of the provider's citation budget do I own?	`target_citations / all_citations`
Sentiment	How favorably am I mentioned?	judge-labeled `positive/neutral/negative` distribution

Rate metrics carry seeded 1,000-resample bootstrap 95% confidence intervals so single-run point estimates are never over-read (LLM answers are stochastic — see "Don't Measure Once", arXiv:2604.07585). Share of Voice requires a competitors list in the spec; runs without one get a missing_competitor_set quality advisory and report SoV as unknown (null), never 0.

MRR uses semantic-first judging: gpt-4o LLM judge reads each response and assigns a semantic endorsement rank (with answer_intent ∈ recommendation/comparison/neutral_info/unknown). Evidence quotes must be exact substrings of the response (literal-match validation). A deterministic list parser runs in parallel as audit-only calibration (deterministic_rank, deterministicAgreementRate); it never overrides the judge. Read MRR alongside Hit Rate — MRR alone misleads when hits are rare.

Surface Fidelity

⚠️ Surface Fidelity is NOT a GEO performance metric. It checks whether an API response matches the same provider's web/browser response for a given query — a consistency check, not a brand-visibility KPI.

Fidelity = 0.6 × Jaccard(citations) + 0.4 × cosine(answer embeddings)

Pass threshold: median provider Jaccard ≥ 0.70 AND median provider cosine ≥ 0.85.

Run via geobench triangulate ... to compare API vs CLI vs browser surfaces.

Profiling unknown products

Most products aren't well-known to LLMs. If you run geobench bench on a niche product with only category: "AI agent skills", the synthesizer generates generic queries like "best AI tools" — useless for measuring real GEO visibility. Profiling fixes this by extracting a use-case-driven understanding of what user problems the product actually solves.

The profiler fetches your product's public pages, extracts use cases with evidence quotes, sanitizes out any identifying information, and writes an enriched_profile block back to your spec yaml. The synthesizer then uses those use cases as its primary axis instead of category.

⚠️ Leakage guard: the synthesizer LLM never sees the product's name, aliases, romanizations, or domains. Only anonymized use-case scenarios reach the synthesis prompt. See docs/methodology.md — Part C for the full sanitization pipeline.

The profiling flow

1. Add discovery_sources to your spec yaml.

name: "Your Product Name"
aliases: ["Alias1"]
cited_domains: ["yourproduct.com"]
target_languages: ["en", "ko"]
discovery_sources:
  - "https://github.com/your-org/your-repo"
  - "https://yourproduct.com"
description: "Optional: a sentence or two about what the product does. Richer context improves use-case extraction."

2. Run the profiler.

geobench profile fixtures/your-spec.yaml

The profiler fetches each URL, extracts use cases with evidence quotes, validates quotes against source content, sanitizes product identifiers, and writes the result back to the yaml.

3. Inspect the new enriched_profile block.

enriched_profile:
  generated_at: "2026-05-07T10:00:00Z"
  profiler_model: "gpt-4o"
  value_proposition: "Helps users find and operate specialized AI-agent capabilities..."
  use_cases:
    - problem_statement: "When a Claude Code user wants to add domain-specific real-world task capabilities without writing custom integrations."
      audience: "AI coding agent users seeking ready-made skills"
      evidence_quotes:
        - "real-world task"
        - "AI agents"
      confidence: 0.9
      language: "en"
    - problem_statement: "한국에서 일상 업무 자동화를 원하지만 영어 위주 AI 도구로는 한계가 있을 때..."
      audience: "Korean-speaking productivity users"
      evidence_quotes:
        - "한국 실생활"
      confidence: 0.85
      language: "ko"

Check that you have 5–10 use cases, each with an anonymized problem_statement (no product name), and that the evidence quotes look like real phrases from your source pages.

4. Run bench as usual.

geobench bench fixtures/your-spec.yaml --providers openai --tier cheap

The synthesizer now generates queries shaped around real user scenarios rather than category keywords.

Failure hint

If you run geobench bench on a spec that has discovery_sources but no enriched_profile, the command fails with:

Error: spec has discovery_sources but no enriched_profile. Run `geobench profile` first.

For a full step-by-step walkthrough including dry-run, cost estimation, and troubleshooting, see docs/runbooks/profile-workflow.md.

Multi-Surface Setup

Model Tiers

Use --tier to select model quality:

cheap (default): fast/cheap models (gpt-4o-mini, claude-haiku, gemini-flash)
consumer: flagship models matching web UI (gpt-4o, claude-sonnet-4-5, gemini-2.5-pro)
premium: best available (o3, claude-opus-4-5)

geobench bench --product fixtures/sample.yaml --tier consumer --providers openai,anthropic

CLI Agent Adapters

First probe which CLI agents are available:

geobench probe-agents

Install CLI agents as needed:

Claude Code: npm install -g @anthropic-ai/claude-code then claude login
Codex: npm install -g @openai/codex then codex login
Gemini CLI: npm install -g @google/gemini-cli then gemini login
Hermes/Grok: see xAI documentation

Run with CLI agents:

geobench bench --product fixtures/sample.yaml \
  --providers cli:claude-code,cli:codex,api:openai \
  --tier consumer --shots 10

Cross-Validation Triangulation

Compare API vs CLI vs browser surfaces:

geobench triangulate \
  --query-set fixtures/triangulation-frozen-50.yaml \
  --providers api:openai,api:anthropic,cli:claude-code \
  --baseline api:anthropic \
  --out report.json

Browser Automation

ToS Warning: Browser automation may violate provider terms of service. Use at your own risk with a secondary/burner account. Never use your primary account.

Supported targets:

Perplexity (most permissive): browser:perplexity

Requires explicit risk acknowledgment:

geobench bench --product fixtures/sample.yaml \
  --providers browser:perplexity \
  --i-accept-browser-automation-risk

Set BROWSER_PROFILE_DIR to your Chrome profile directory (must be logged in to the target site).

System Prompts

System prompts improve API fidelity by replicating web UI behavior.

Fetch system prompts (not stored in git):

# Official Anthropic prompt (safe)
bun run scripts/refresh-system-prompts.ts --id anthropic-published-claude-2025-Q1

# Leaked prompts (unofficial, use at your own risk)
bun run scripts/refresh-system-prompts.ts --id openai-leaked-chatgpt-2025-Q1

Disclaimer: Leaked system prompts are unofficial and may be outdated. Provider may detect and rotate them. Use for research purposes only.

Run with system prompts:

geobench bench --product fixtures/sample.yaml \
  --system-prompt-set anthropic-published-claude-2025-Q1 \
  --providers anthropic