LLM Benchmark Costco ๐Ÿ›’

March 11, 2026 ยท View on GitHub

Stars Forks License Last Commit Benchmarks Live Demo

A curated, searchable database of 378 LLM evaluation benchmarks across 10 capability dimensions โ€” with inline PDF reading, Mermaid build flowcharts, bilingual UI, dark mode, neon glow effects, and automated CI/CD.

๐ŸŒ Live Demo ยท ๐Ÿ“Š Browse Benchmarks ยท ๐Ÿค Contribute


LLM Benchmark Costco Demo


Why LLM Benchmark Costco?

FeatureCostcoPapersWithCodeHuggingFace DatasetsarXiv Search
Curated LLM benchmarks onlyโœ…โŒ (all ML)โŒ (all datasets)โŒ
Inline PDF readingโœ…โŒโŒโŒ
Build process flowchartsโœ…โŒโŒโŒ
Multi-dim filtering (year/difficulty/openness)โœ…PartialPartialโŒ
Bilingual (EN/ZH)โœ…โŒโŒโŒ
Related benchmarks & family lineageโœ…โŒโŒโŒ
Dark mode with neon glow effectsโœ…โŒโŒโŒ
Automated CI/CD deploymentโœ…โŒโŒโŒ

Features

  • 378 Benchmarks across 10 capability dimensions โ€” Agent Capability (71), General Language (39), Multimodal (72), Code (40), Science & Reasoning (18), Safety & Alignment (24), Medical & Health (58), and more.
  • Neon Glow & Shimmer Effects โ€” Interactive neon glow effect on card hover and a subtle shimmer animation on the logo in dark mode.
  • Inline PDF Reading โ€” Click any card to open the details drawer and read the full paper without leaving the page. Most entries embed the original arXiv PDF directly.
  • Build Process Flowcharts โ€” Over 200 benchmarks include Mermaid-rendered diagrams explaining exactly how the dataset was constructed. Now with fullscreen mode for complex flowcharts.
  • Powerful Filtering โ€” Filter by L1 capability category, year (including 2025/2026 latest), difficulty level (Basic โ†’ Frontier), and data openness (Public / Partly / In-house).
  • Family & Lineage โ€” Explore benchmark families (e.g., MMLU, GAIA, SWE-bench) and related benchmarks to understand the evaluation landscape.
  • Bilingual UI โ€” Full English and Chinese interface with bilingual data fields.
  • Automated CI/CD โ€” GitHub Actions automatically validate and deploy updates to GitHub Pages when benchmarks.json is changed.

Quick Start

# Install dependencies
pnpm install

# Local development
pnpm dev

# Build for GitHub Pages
pnpm build:ghpages

Deployment

This project uses GitHub Actions for automated deployment. Any push to the main branch that includes changes to client/public/benchmarks.json will trigger a new build and deployment to the gh-pages branch.

A daily cron job also runs to sync any external changes to benchmarks.json.

Manual Deployment

If you need to deploy manually:

  1. Fork or clone this repository
  2. Go to Settings โ†’ Pages
  3. Set Source to Deploy from a branch โ†’ gh-pages
  4. Run pnpm build:ghpages && npx gh-pages -d dist-ghpages to deploy

Access at: https://<username>.github.io/llm-benchmark-costco/

Sub-path configuration: If deploying under a sub-path, set base: '/your-repo-name/ ' in vite.ghpages.config.ts.

Updating Benchmark Data

The data lives in client/public/benchmarks.json. Before updating, read CONTRIBUTING.md for the complete workflow covering data schema, validation, and CI process.

Tech Stack

LayerTechnology
FrontendReact 19 + TypeScript
StylingTailwind CSS 4
BuildVite 7
RoutingWouter
CI/CDGitHub Actions
IconsLucide React
DiagramsMermaid
DeploymentGitHub Pages

Project Structure

llm-benchmark-costco/
โ”œโ”€โ”€ .github/workflows/              # GitHub Actions CI/CD
โ”‚   โ”œโ”€โ”€ ci.yml                      # PR validation
โ”‚   โ”œโ”€โ”€ deploy.yml                  # Deploy on data change
โ”‚   โ””โ”€โ”€ sync-and-deploy.yml         # Daily sync
โ”œโ”€โ”€ client/
โ”‚   โ”œโ”€โ”€ public/
โ”‚   โ”‚   โ””โ”€โ”€ benchmarks.json          # 378 benchmark entries
โ”‚   โ””โ”€โ”€ src/
โ”‚       โ”œโ”€โ”€ components/
โ”‚       โ”‚   โ”œโ”€โ”€ BenchmarkCard.tsx     # Card component with neon glow
โ”‚       โ”‚   โ”œโ”€โ”€ BenchmarkDrawer.tsx   # Detail drawer + PDF + flowchart
โ”‚       โ”‚   โ”œโ”€โ”€ FilterBar.tsx         # Filter controls
โ”‚       โ”‚   โ””โ”€โ”€ Navbar.tsx            # Top navigation with logo shimmer
โ”‚       โ”œโ”€โ”€ contexts/
โ”‚       โ”‚   โ””โ”€โ”€ LangContext.tsx       # i18n (EN/ZH)
โ”‚       โ”œโ”€โ”€ hooks/
โ”‚       โ”‚   โ””โ”€โ”€ useBenchmarks.ts      # Data loading & filtering
โ”‚       โ””โ”€โ”€ types/
โ”‚           โ””โ”€โ”€ benchmark.ts          # TypeScript types
โ”œโ”€โ”€ scripts/
โ”‚   โ””โ”€โ”€ validate_benchmarks.py      # Data validation script
โ”œโ”€โ”€ vite.ghpages.config.ts            # GitHub Pages build config
โ””โ”€โ”€ README.md

Contributing

We welcome contributions! The easiest way to contribute is to submit a new benchmark via GitHub Issues using the Submit New Benchmark template โ€” no coding required.

For code contributions, please read CONTRIBUTING.md.

Contributors

Contributors

License

MIT