ELO Router

December 29, 2025 · View on GitHub

Overview

The ELO Router is a rating-based routing method that ranks LLMs using the Elo rating system, originally developed for chess. It converts historical performance data into pairwise comparisons and computes a global ranking. All queries are routed to the single highest-rated LLM.

Paper Reference

This router is inspired by the Elo Rating System and RouteLLM:

  • RouteLLM: Learning to Route LLMs with Preference Data

    • Ong, I., et al. (2024). arXiv:2406.18665. Published at ICLR 2025.
    • Implements sw_ranking router using weighted Elo calculation.
  • Original Elo System:

    • Elo, A. E. (1978). "The Rating of Chessplayers, Past and Present." Arco Publishing.
  • Application to LLMs:

    • Zheng, L., et al. (2023). "Judging LLM-as-a-Judge with MT-Bench and Chatbot Arena." NeurIPS.
    • Bradley-Terry Model: Bradley, R. A., & Terry, M. E. (1952). "Rank Analysis of Incomplete Block Designs." Biometrika.

How It Works

Architecture

Historical Data → Pairwise Battles → Elo Computation → Single Best Model Selection
                  (winner/loser)     (Logistic Regression MLE)

Routing Mechanism

  1. Training Phase:

    • For each query in training data, identify the best-performing LLM
    • Create pairwise "battles": winner (best LLM) vs. each loser (other LLMs)
    • Use logistic regression to estimate Elo scores via Maximum Likelihood Estimation
    • Save Elo scores to disk
  2. Inference Phase:

    • Load precomputed Elo scores
    • Always select the LLM with the highest Elo rating
    • Route ALL queries to this single model (query-independent routing)

Key Characteristics

  • Global Ranking: Computes a single global rating for each LLM
  • Query-Independent: Unlike KNN/MLP/SVM, ignores query content during inference
  • Pairwise Comparisons: Based on relative performance, not absolute scores
  • Statistical Foundation: Grounded in Bradley-Terry model and MLE

Elo Computation Formula

For each pairwise battle (model A vs. model B):

P(A wins) = 1 / (1 + 10^((Rating_B - Rating_A) / 400))

The trainer uses logistic regression to find Elo ratings that maximize the likelihood of observed battle outcomes.

Training Process

1. Build Battle Data

For each query:

  • Identify the best-performing model (winner)
  • Create battles: winner vs. all other models (losers)
  • Generate symmetric battles (A vs B and B vs A) for balanced training

Example:

Query: "Explain gravity"
Performance: GPT-4 (0.95), Claude (0.85), Llama (0.70)

Battles created:
  GPT-4 vs Claude → GPT-4 wins
  GPT-4 vs Llama  → GPT-4 wins
  Claude vs GPT-4 → Claude loses
  Llama vs GPT-4  → Llama loses

2. Estimate Elo Scores

Uses logistic regression MLE to find Elo ratings that best explain battle outcomes:

  • Initialize all models at 1000 rating
  • Fit logistic regression to predict battle winners
  • Convert coefficients to Elo scores (scaled by 400)

3. Save Rankings

Saves Elo scores as a dictionary: {"GPT-4": 1250, "Claude": 1180, "Llama": 950}

Configuration Parameters

Training Parameters

No hyperparameters to tune! The Elo computation is deterministic given the training data.

Fixed Constants (in trainer code):

  • SCALE: 400.0 - Standard Elo scale factor
  • BASE: 10.0 - Elo probability base
  • INIT_RATING: 1000.0 - Starting rating for all models

Data Paths

ParameterDescription
query_data_trainTraining queries in JSONL format
routing_data_trainHistorical routing performance data (query-LLM pairs with performance scores)
llm_dataLLM candidate information (models, API names, metadata)

Model Paths

ParameterPurposeUsage
save_model_pathWhere to save computed Elo scoresTraining: saves {model_name: elo_score} dictionary
load_model_pathElo scores to load for inferenceTesting: path to saved .pkl file

Inference Parameters

During inference:

  • Loads Elo scores from load_model_path
  • Selects the model with the highest rating
  • Routes all queries to this single model
  • No query-specific routing decisions

CLI Usage

The ELO Router can be used via the llmrouter command-line interface:

Training

# Compute Elo rankings
llmrouter train --router elorouter --config configs/model_config_train/elorouter.yaml

# Train with quiet mode
llmrouter train --router elorouter --config configs/model_config_train/elorouter.yaml --quiet

Inference

# Route a single query (always selects highest-rated model)
llmrouter infer --router elorouter --config configs/model_config_test/elorouter.yaml \
    --query "What is the meaning of life?"

# Route queries from a file
llmrouter infer --router elorouter --config configs/model_config_test/elorouter.yaml \
    --input queries.jsonl --output results.json

# Route only (without calling LLM API)
llmrouter infer --router elorouter --config configs/model_config_test/elorouter.yaml \
    --query "Explain quantum mechanics" --route-only

Interactive Chat

# Launch chat interface
llmrouter chat --router elorouter --config configs/model_config_test/elorouter.yaml

# Launch with custom port
llmrouter chat --router elorouter --config configs/model_config_test/elorouter.yaml --port 8080

# Create a public shareable link
llmrouter chat --router elorouter --config configs/model_config_test/elorouter.yaml --share

Usage Examples

Training the ELO Router

from llmrouter.models import EloRouter, EloRouterTrainer

# Initialize router with training configuration
router = EloRouter(yaml_path="configs/model_config_train/elorouter.yaml")

# Create trainer
trainer = EloRouterTrainer(router=router, device="cpu")

# Compute Elo scores
trainer.train()
# Elo scores will be saved to the path specified in save_model_path

# View the computed rankings
print("Elo Rankings:")
for model, score in sorted(router.elo_scores.items(), key=lambda x: -x[1]):
    print(f"  {model}: {score:.2f}")

Command Line Training:

python tests/train_test/test_elorouter.py --yaml_path configs/model_config_train/elorouter.yaml

Inference: Routing Queries

from llmrouter.models import EloRouter

# Initialize router with test configuration (loads Elo scores)
router = EloRouter(yaml_path="configs/model_config_test/elorouter.yaml")

# Route a single query
query = {"query": "What is the meaning of life?"}
result = router.route_single(query)

print(f"Selected Model: {result['model_name']}")
# Note: This will ALWAYS be the same model (highest Elo rating)

Batch Routing with API Execution

from llmrouter.models import EloRouter

# Initialize router
router = EloRouter(yaml_path="configs/model_config_test/elorouter.yaml")

# Prepare batch of queries
queries = [
    {"query": "Explain quantum mechanics", "ground_truth": "..."},
    {"query": "Write a poem about AI", "ground_truth": "..."},
    {"query": "Solve x^2 + 5x + 6 = 0", "ground_truth": "..."}
]

# Route and execute (all queries go to the same best model)
results = router.route_batch(batch=queries, task_name="general")

# All queries routed to the same model
unique_models = set(r['model_name'] for r in results)
print(f"Number of unique models used: {len(unique_models)}")  # Always 1

YAML Configuration Example

Training Configuration (configs/model_config_train/elorouter.yaml):

data_path:
  query_data_train: 'data/example_data/query_data/default_query_train.jsonl'
  routing_data_train: 'data/example_data/routing_data/default_routing_train_data.jsonl'
  llm_data: 'data/example_data/llm_candidates/default_llm.json'

model_path:
  ini_model_path: ''
  save_model_path: 'saved_models/elorouter/elorouter.pkl'

metric:
  weights:
    performance: 1    # Primary criterion for determining winners
    cost: 0
    llm_judge: 0

Testing Configuration (configs/model_config_test/elorouter.yaml):

data_path:
  llm_data: 'data/example_data/llm_candidates/default_llm.json'

model_path:
  load_model_path: 'saved_models/elorouter/elorouter.pkl'

Advantages

  • Simple and Interpretable: Single global ranking that's easy to understand
  • Statistically Grounded: Based on Bradley-Terry model and MLE
  • No Hyperparameters: No tuning required, fully deterministic
  • Handles Imbalanced Comparisons: Elo naturally handles varying numbers of battles per model
  • Battle-Tested: Proven system used in chess, sports, and now LLM leaderboards
  • Fast Inference: Just a dictionary lookup (O(1))

Limitations

  • Query-Agnostic: Ignores query content, always routes to the same model
  • No Specialization: Cannot leverage model strengths for specific query types
  • Single Model: Cannot distribute load or use ensembles
  • Assumes Transitivity: Assumes if A > B and B > C, then A > C (may not hold for LLMs)
  • Static Rankings: Must retrain to update Elo scores
  • No Cost-Performance Trade-off: Always chooses highest-rated model regardless of cost
  • Data Hungry: Needs sufficient pairwise comparisons for accurate rankings

When to Use ELO Router

Good Use Cases:

  • Want a simple baseline that always uses the "best" model
  • Need a global ranking of LLM capabilities
  • Have abundant training data with consistent evaluation metrics
  • Don't need query-specific routing (all queries are similar)
  • Want interpretable, explainable routing (just show the Elo ranking)

NOT Recommended When:

  • Queries have diverse types (coding, math, creative writing, etc.)
  • Need to optimize cost (Elo always picks highest-rated, often most expensive model)
  • Want to leverage specialized model strengths
  • Need to distribute load across multiple models
  • Have limited training data

Understanding Elo Scores

Interpretation

Elo Score    Meaning
---------    -------
1400+        Dominant model, wins most battles
1200-1400    Strong model, competitive
1000-1200    Average model, mixed performance
800-1000     Weak model, loses most battles
<800         Very weak model, rarely wins

Elo Difference and Win Probability

Elo Diff     Expected Win Rate
--------     -----------------
0            50%
100          64%
200          76%
400          91%

If Model A has Elo 1200 and Model B has Elo 1000 (diff = 200), Model A is expected to win ~76% of battles.

Comparison with Other Routers

AspectELO RouterKNN RouterMLP/SVM Router
Query-Specific❌ No✅ Yes✅ Yes
Training SpeedFastNoneMedium
Inference SpeedInstantMediumFast
InterpretabilityHigh (rankings)High (neighbors)Low
Model DiversitySingle modelMultiple modelsMultiple models
HyperparametersNoneFewMany
Data EfficiencyMediumHighMedium

Implementation Details

  • Framework: Custom implementation using scikit-learn's LogisticRegression
  • Battle Generation: Symmetric battles (A vs B and B vs A) for balanced training
  • MLE Solver: LBFGS optimizer for logistic regression
  • Output: Dictionary mapping model names to Elo scores
  • Serialization: Saved as .pkl files using pickle

Tips for Best Performance

  1. Training Data Quality:

    • Ensure performance metrics are reliable and consistent
    • Include diverse queries to avoid bias
    • Need sufficient queries (50+ recommended) for stable rankings
  2. Performance Metric Selection:

    • Use metric.weights.performance = 1 for accuracy-based ranking
    • Can incorporate cost if needed (but defeats purpose of pure ranking)
    • Ensure metric is comparable across different query types
  3. Model Pool:

    • Works best with 3-10 models
    • Too few models → limited routing value
    • Too many models → sparse battle data per pair
  4. Retraining Strategy:

    • Retrain periodically as new data arrives
    • Monitor if model capabilities change over time
    • Update when adding new models to the pool
  5. Use as Baseline:

    • ELO Router is excellent as a baseline for comparison
    • Compare query-specific routers against ELO to measure value of personalization

Relation to Chatbot Arena

This router is directly inspired by Chatbot Arena (LMSYS):

  • Chatbot Arena uses Elo ratings to rank LLMs based on human preferences
  • Users vote on pairwise comparisons → Elo scores computed
  • Creates public LLM leaderboard

Key Difference:

  • Chatbot Arena: Human preference battles
  • ELO Router: Automated performance metric battles

Advanced Usage

Custom Elo Parameters

While the defaults work well, you can modify the Elo constants in trainer.py:

# Larger SCALE → bigger rating differences
elo_scores = compute_elo_mle(battles_df, SCALE=500.0, BASE=10.0, INIT_RATING=1500.0)

Incorporating Costs

You can modify the battle generation to consider cost-adjusted performance:

# In custom trainer
df["adjusted_performance"] = df["performance"] / (df["cost"] ** 0.5)
# Then use adjusted_performance to determine winners

Multi-Metric Elo

Compute separate Elo rankings for different metrics (accuracy, speed, cost-efficiency) and combine them.

  • Largest LLM Router: Always picks the largest model (simpler heuristic)
  • Smallest LLM Router: Always picks the smallest model (cost-focused)
  • Hybrid LLM Router: Weighted combination of multiple routing strategies
  • Matrix Factorization Router: Learns query-model affinity (query-specific alternative)

For questions or issues, please refer to the main LLMRouter documentation or open an issue on GitHub.