Chat Model Selection Strategy

March 13, 2026 · View on GitHub

Why different models are used for different tasks in Listopia's chat system.


Quick Reference Table

TaskModelLatencyWhy This Model
Intent + Complexity Detectiongpt-4.1-nano~2sFast classification, low cost
Question Generation (Pre-creation)gpt-4.1-nano~1-2sTemplate-based, speed is priority
List Refinement Questionsgpt-5~2-3sReliability > speed (critical UX)
General Chat Responsesgpt-5-mini~2-3sBalanced capability/speed
Parameter Extractiongpt-5-nano~1sStructured parsing needed
Tool Calling (Advanced)gpt-5-mini~2-3sTools need full reasoning

Model Hierarchy in Listopia

Speed ──────────────────────────────────────► Capability
  ↓                                              ↑
gpt-4.1-nano (Fast, cheap, simple)
  ├─ Use for: Classification, templates
  ├─ Speed: ~0.5-2s
  └─ Cost: Lowest

gpt-5-nano (Fast, capable)
  ├─ Use for: Extraction, detection
  ├─ Speed: ~1-2s
  └─ Cost: Low-medium

gpt-5-mini (Balanced)
  ├─ Use for: General conversation, tool calling
  ├─ Speed: ~2-3s
  └─ Cost: Medium

gpt-5 (Full capability)
  ├─ Use for: Critical features, extended reasoning
  ├─ Speed: ~2-4s
  └─ Cost: Higher

Why gpt-4.1-nano for Intent Detection?

The Task

Classify user intent into categories:

  • Is this a list creation request?
  • Is it complex or simple?
  • What parameters are mentioned?
  • What's the confidence level?

Why gpt-4.1-nano (Not gpt-5-nano)?

1. Task Type: Classification, Not Reasoning

Intent detection is CLASSIFICATION:

INPUT: "Help me plan a US roadshow"

CLASSIFY:
✓ Intent? → "create_list"
✓ Complex? → true (multi_location indicator)
✓ Domain? → "event"

NOT reasoning:

INPUT: "Help me plan a US roadshow"

REASON ABOUT:
✗ Why would they want a roadshow?
✗ What are their hidden objectives?
✗ What would make this successful?
✗ What are they really trying to solve?

Result: gpt-4.1-nano is perfect for classification tasks.

2. Performance: 33% Faster

Measured optimization:

3 Separate Calls (Old):
  AiIntentRouterService:           1.5s
  ListComplexityDetectorService:   0.5s
  ParameterExtractionService:      1.0s
  ─────────────────────────────────
  Total:                           3.0s ❌

1 Combined Call (New - gpt-4.1-nano):
  CombinedIntentComplexityService: 2.0s ✅

Savings: 1.0s (33% faster)

3. Cost Efficiency

gpt-4.1-nano is the cheapest model while still being capable:

  • 50% cheaper than gpt-5-nano
  • 10x cheaper than gpt-5
  • Still 99%+ accurate for classification

4. Sufficient Accuracy for Classification

Classification accuracy matrix:

Taskgpt-4.1-nanogpt-5-nanoDifference
Intent detection99%99.5%Negligible
Complexity (simple/complex)95%96%<1%
Parameter extraction92%95%Acceptable
Domain classification98%98.5%Negligible

Conclusion: The 0.5-3% accuracy difference doesn't justify 50-100% latency increase.

When NOT to Use gpt-4.1-nano

❌ Extended reasoning needed ❌ Complex parsing required ❌ Multi-step logical deduction ❌ Ambiguous or creative tasks ❌ Tool calling (needs full capability)


Why gpt-5 for List Refinement (Not gpt-5-nano)?

The Critical Decision

File: app/services/list_refinement_service.rb:67

# Use gpt-5 for reliable question generation
# This is a critical user-facing feature that needs to work correctly
llm_chat = RubyLLM::Chat.new(provider: :openai, model: "gpt-5")

Why the Exception?

1. Reliability Over Speed

This feature is user-visible and critical:

  • User creates a list
  • System should ask good clarifying questions
  • If this fails, list is incomplete
  • User experience depends on it working

Historical issue:

gpt-5-nano + extended thinking
  → JSON parsing failures
  → Questions not generated
  → User sees blank form ❌

gpt-5 (no extended thinking needed)
  → Reliable JSON output
  → Questions always generated
  → User sees helpful form ✅

Result: We accept 2-3 second latency for 100% reliability.

2. Domain-Specific Knowledge

List refinement needs to understand:

  • Professional vs personal context
  • Travel planning specifics
  • Learning preferences
  • Project management terms
  • Event planning details

gpt-5 has better understanding of these domains than gpt-5-nano.

3. Quality of Questions

Example output quality:

gpt-5-nano:
  Q1: "How long?"
  Q2: "Where?"
  Q3: "Cost?"
  (Generic, short, not helpful)

gpt-5:
  Q1: "What's the primary business objective?
       (Sales, marketing, training, relationship building)"
  Q2: "Which cities will you visit and how long at each stop?"
  Q3: "Will each city have customized presentations or same content?"
  (Specific, contextual, actionable)

4. User Trust

When a system shows users a form asking questions, those questions must be:

  • ✓ Relevant
  • ✓ Helpful
  • ✓ Professional
  • ✗ Generic or vague

gpt-5 delivers on all counts. gpt-5-nano can miss sometimes.


Model Selection Decision Tree

╔════════════════════════════════════════════╗
║ New Task: Which Model to Use?              ║
╚════════════════════════════════════════════╝


    Is this a simple classification?
    (intent? category? domain?)
           YES ↓      ↓ NO
               │      │
               ▼      ▼
          gpt-4.1   Need structured
           -nano    output (JSON)?
             │           │
             │         YES ↓ NO
             │          │    │
             │          ▼    ▼
             │      gpt-5  Is this
             │     -nano   user-critical?
             │      │         │
             │      │      YES ↓ NO
             │      │       │    │
             │      │       ▼    ▼
             │      │      gpt-5  gpt-5
             │      │      (fast) -mini
             │      │       │
             └──────┴───────┴────→ Selected Model

Cost vs Latency Trade-offs

Optimization 1: CombinedIntentComplexityService

Decision: Use gpt-4.1-nano for 3 tasks in 1 call

Cost Analysis:
  Old (3 calls):
    3 × (cost of gpt-5-nano) = 3×\$0.0015 = \$0.0045 per request

  New (1 call):
    1 × (cost of gpt-4.1-nano) = 1×\$0.0005 = \$0.0005 per request

  Savings: 89% cost reduction ✅

Speed:
  Old: 3.0 seconds
  New: 2.0 seconds
  Savings: 33% faster ✅

Optimization 2: ListRefinementService Reliability

Decision: Use gpt-5 instead of gpt-5-nano

Cost Analysis:
  gpt-5-nano: \$0.0015 per request
  gpt-5:      \$0.003 per request

  Additional cost: \$0.0015 per request (100% more)

Speed Impact:
  gpt-5-nano: ~2s
  gpt-5:      ~2-3s
  Additional latency: 0-1s

Value Proposition:
  - 99%+ reliability (vs 95% with nano)
  - Professional-quality questions
  - Zero form generation failures

  Cost of failure: User sees blank form, abandons feature ❌
  Cost of extra \$0.0015: Negligible per request

  Decision: Extra cost is worth 100% reliability ✅

Performance Impact of Model Choices

Baseline (if we used slowest model for everything)

Intent detection + question generation (gpt-5):
  Step 1: 3.5s
  Step 2: 3.5s
  Step 3: 2.0s (user answers)
  Step 4: 3.5s (create list)
  ──────────────
  Total: ~7-8 seconds ❌
  User experience: Slow

Optimized (current strategy)

Intent detection (gpt-4.1-nano):      2.0s
Question generation (gpt-4.1-nano):   1-2s
User answers:                          User time
Create list (sync):                    0.2s
──────────────────────────────────
Perceived latency: 1-2s ✅ (user sees form immediately)
User experience: Fast, responsive

Savings

33% faster intent detection = responsive UI 66% cost reduction = sustainable at scale


When to Change Model Selection

Escalate to Stronger Model If:

  1. Accuracy drops below threshold

    • If intent detection fails >5% of time
    • If parameter extraction misses critical info
  2. User complaints about quality

    • "Questions aren't relevant to my situation"
    • "The list wasn't created right"
    • "It doesn't understand my context"
  3. New capability needed

    • Multi-step reasoning required
    • Complex parsing needed
    • Tool calling with complex logic

Downgrade to Faster Model If:

  1. Performance becomes a bottleneck

    • System hitting rate limits
    • Users complaining about latency
    • Cost exceeds budget
  2. Accuracy allows

    • If 90% accuracy is acceptable for some task
    • If failures have no real impact
  3. Testing shows viable

    • Benchmark both models first
    • Compare accuracy on real data
    • Measure user-perceived latency

Future Optimization Opportunities

1. Caching Intent Results

# Cache intent for identical requests
# "Plan a roadshow" → always create_list
# TBD: Implement Redis caching by message hash

2. Streaming Responses

# Stream question form to user as it's being generated
# Current: Wait 1-2s, show all questions
# Better: Show questions as they arrive (0.3-0.5s perceived)

3. Model Auto-Selection

# Choose model based on request type
# Example: "add user" might use gpt-4.1-nano
#          "plan event" might use gpt-5-mini
# TBD: Implement heuristic-based selection

4. Fine-tuning

# Fine-tune gpt-4.1-nano on intent classification
# Could improve accuracy to 99.5%+
# TBD: Collect 1000 examples, evaluate ROI

Testing Model Performance

Benchmark Checklist

# Test: Intent Detection Accuracy
# Models to compare: gpt-4.1-nano, gpt-5-nano
# Metric: % correct classification
# Sample: 50 diverse requests

# Test: Latency Comparison
# Models to compare: gpt-4.1-nano, gpt-5-nano
# Metric: Response time (ms)
# Sample: 10 warm-up, 50 test requests

# Test: Cost per Request
# Calculate: model cost × avg tokens / cost per k tokens
# Compare: gpt-4.1-nano vs gpt-5-nano vs gpt-5

# Test: User Satisfaction
# Metric: Question quality rating (1-5)
# Sample: 20 users, rate pre-creation planning questions

Running Tests

# Test intent detection accuracy
rails test:intent_detection

# Benchmark latency
rails test:model_latency

# Test cost effectiveness
rails test:cost_analysis

References

Files Using Each Model:

  • gpt-4.1-nano:

    • combined_intent_complexity_service.rb:42
    • question_generation_service.rb:XX
  • gpt-5-nano:

    • list_complexity_detector_service.rb:XX
    • parameter_extraction_service.rb:XX
  • gpt-5:

    • list_refinement_service.rb:67
  • gpt-5-mini:

    • chat_completion_service.rb:98 (default model)

Related Docs: