Chat Model Selection Strategy

March 13, 2026 · View on GitHub

Why different models are used for different tasks in Listopia's chat system.

Quick Reference Table

Task	Model	Latency	Why This Model
Intent + Complexity Detection	gpt-4.1-nano	~2s	Fast classification, low cost
Question Generation (Pre-creation)	gpt-4.1-nano	~1-2s	Template-based, speed is priority
List Refinement Questions	gpt-5	~2-3s	Reliability > speed (critical UX)
General Chat Responses	gpt-5-mini	~2-3s	Balanced capability/speed
Parameter Extraction	gpt-5-nano	~1s	Structured parsing needed
Tool Calling (Advanced)	gpt-5-mini	~2-3s	Tools need full reasoning

Model Hierarchy in Listopia

Speed ──────────────────────────────────────► Capability
  ↓                                              ↑
gpt-4.1-nano (Fast, cheap, simple)
  ├─ Use for: Classification, templates
  ├─ Speed: ~0.5-2s
  └─ Cost: Lowest

gpt-5-nano (Fast, capable)
  ├─ Use for: Extraction, detection
  ├─ Speed: ~1-2s
  └─ Cost: Low-medium

gpt-5-mini (Balanced)
  ├─ Use for: General conversation, tool calling
  ├─ Speed: ~2-3s
  └─ Cost: Medium

gpt-5 (Full capability)
  ├─ Use for: Critical features, extended reasoning
  ├─ Speed: ~2-4s
  └─ Cost: Higher

Why gpt-4.1-nano for Intent Detection?

The Task

Classify user intent into categories:

Is this a list creation request?
Is it complex or simple?
What parameters are mentioned?
What's the confidence level?

Why gpt-4.1-nano (Not gpt-5-nano)?

1. Task Type: Classification, Not Reasoning

Intent detection is CLASSIFICATION:

INPUT: "Help me plan a US roadshow"

CLASSIFY:
✓ Intent? → "create_list"
✓ Complex? → true (multi_location indicator)
✓ Domain? → "event"

NOT reasoning:

INPUT: "Help me plan a US roadshow"

REASON ABOUT:
✗ Why would they want a roadshow?
✗ What are their hidden objectives?
✗ What would make this successful?
✗ What are they really trying to solve?

Result: gpt-4.1-nano is perfect for classification tasks.

2. Performance: 33% Faster

Measured optimization:

3 Separate Calls (Old):
  AiIntentRouterService:           1.5s
  ListComplexityDetectorService:   0.5s
  ParameterExtractionService:      1.0s
  ─────────────────────────────────
  Total:                           3.0s ❌

1 Combined Call (New - gpt-4.1-nano):
  CombinedIntentComplexityService: 2.0s ✅

Savings: 1.0s (33% faster)

3. Cost Efficiency

gpt-4.1-nano is the cheapest model while still being capable:

50% cheaper than gpt-5-nano
10x cheaper than gpt-5
Still 99%+ accurate for classification

4. Sufficient Accuracy for Classification

Classification accuracy matrix:

Task	gpt-4.1-nano	gpt-5-nano	Difference
Intent detection	99%	99.5%	Negligible
Complexity (simple/complex)	95%	96%	<1%
Parameter extraction	92%	95%	Acceptable
Domain classification	98%	98.5%	Negligible

Conclusion: The 0.5-3% accuracy difference doesn't justify 50-100% latency increase.

When NOT to Use gpt-4.1-nano

❌ Extended reasoning needed ❌ Complex parsing required ❌ Multi-step logical deduction ❌ Ambiguous or creative tasks ❌ Tool calling (needs full capability)

The Critical Decision

File: app/services/list_refinement_service.rb:67

# Use gpt-5 for reliable question generation
# This is a critical user-facing feature that needs to work correctly
llm_chat = RubyLLM::Chat.new(provider: :openai, model: "gpt-5")

Why the Exception?

1. Reliability Over Speed

This feature is user-visible and critical:

User creates a list
System should ask good clarifying questions
If this fails, list is incomplete
User experience depends on it working

Historical issue:

gpt-5-nano + extended thinking
  → JSON parsing failures
  → Questions not generated
  → User sees blank form ❌

gpt-5 (no extended thinking needed)
  → Reliable JSON output
  → Questions always generated
  → User sees helpful form ✅

Result: We accept 2-3 second latency for 100% reliability.

2. Domain-Specific Knowledge

List refinement needs to understand:

Professional vs personal context
Travel planning specifics
Learning preferences
Project management terms
Event planning details

gpt-5 has better understanding of these domains than gpt-5-nano.

3. Quality of Questions

Example output quality:

gpt-5-nano:
  Q1: "How long?"
  Q2: "Where?"
  Q3: "Cost?"
  (Generic, short, not helpful)

gpt-5:
  Q1: "What's the primary business objective?
       (Sales, marketing, training, relationship building)"
  Q2: "Which cities will you visit and how long at each stop?"
  Q3: "Will each city have customized presentations or same content?"
  (Specific, contextual, actionable)

4. User Trust

When a system shows users a form asking questions, those questions must be:

✓ Relevant
✓ Helpful
✓ Professional
✗ Generic or vague

gpt-5 delivers on all counts. gpt-5-nano can miss sometimes.

Model Selection Decision Tree

╔════════════════════════════════════════════╗
║ New Task: Which Model to Use?              ║
╚════════════════════════════════════════════╝
                     │
                     ▼
    Is this a simple classification?
    (intent? category? domain?)
           YES ↓      ↓ NO
               │      │
               ▼      ▼
          gpt-4.1   Need structured
           -nano    output (JSON)?
             │           │
             │         YES ↓ NO
             │          │    │
             │          ▼    ▼
             │      gpt-5  Is this
             │     -nano   user-critical?
             │      │         │
             │      │      YES ↓ NO
             │      │       │    │
             │      │       ▼    ▼
             │      │      gpt-5  gpt-5
             │      │      (fast) -mini
             │      │       │
             └──────┴───────┴────→ Selected Model

Cost vs Latency Trade-offs

Optimization 1: CombinedIntentComplexityService

Decision: Use gpt-4.1-nano for 3 tasks in 1 call

Cost Analysis:
  Old (3 calls):
    3 × (cost of gpt-5-nano) = 3×\$0.0015 = \$0.0045 per request

  New (1 call):
    1 × (cost of gpt-4.1-nano) = 1×\$0.0005 = \$0.0005 per request

  Savings: 89% cost reduction ✅

Speed:
  Old: 3.0 seconds
  New: 2.0 seconds
  Savings: 33% faster ✅

Optimization 2: ListRefinementService Reliability

Decision: Use gpt-5 instead of gpt-5-nano

Cost Analysis:
  gpt-5-nano: \$0.0015 per request
  gpt-5:      \$0.003 per request

  Additional cost: \$0.0015 per request (100% more)

Speed Impact:
  gpt-5-nano: ~2s
  gpt-5:      ~2-3s
  Additional latency: 0-1s

Value Proposition:
  - 99%+ reliability (vs 95% with nano)
  - Professional-quality questions
  - Zero form generation failures

  Cost of failure: User sees blank form, abandons feature ❌
  Cost of extra \$0.0015: Negligible per request

  Decision: Extra cost is worth 100% reliability ✅

Performance Impact of Model Choices

Baseline (if we used slowest model for everything)

Intent detection + question generation (gpt-5):
  Step 1: 3.5s
  Step 2: 3.5s
  Step 3: 2.0s (user answers)
  Step 4: 3.5s (create list)
  ──────────────
  Total: ~7-8 seconds ❌
  User experience: Slow

Optimized (current strategy)

Intent detection (gpt-4.1-nano):      2.0s
Question generation (gpt-4.1-nano):   1-2s
User answers:                          User time
Create list (sync):                    0.2s
──────────────────────────────────
Perceived latency: 1-2s ✅ (user sees form immediately)
User experience: Fast, responsive

Savings

33% faster intent detection = responsive UI 66% cost reduction = sustainable at scale

When to Change Model Selection

Escalate to Stronger Model If:

Accuracy drops below threshold
- If intent detection fails >5% of time
- If parameter extraction misses critical info
User complaints about quality
- "Questions aren't relevant to my situation"
- "The list wasn't created right"
- "It doesn't understand my context"
New capability needed
- Multi-step reasoning required
- Complex parsing needed
- Tool calling with complex logic

Downgrade to Faster Model If:

Performance becomes a bottleneck
- System hitting rate limits
- Users complaining about latency
- Cost exceeds budget
Accuracy allows
- If 90% accuracy is acceptable for some task
- If failures have no real impact
Testing shows viable
- Benchmark both models first
- Compare accuracy on real data
- Measure user-perceived latency

Future Optimization Opportunities

1. Caching Intent Results

# Cache intent for identical requests
# "Plan a roadshow" → always create_list
# TBD: Implement Redis caching by message hash

2. Streaming Responses

# Stream question form to user as it's being generated
# Current: Wait 1-2s, show all questions
# Better: Show questions as they arrive (0.3-0.5s perceived)

3. Model Auto-Selection

# Choose model based on request type
# Example: "add user" might use gpt-4.1-nano
#          "plan event" might use gpt-5-mini
# TBD: Implement heuristic-based selection

4. Fine-tuning

# Fine-tune gpt-4.1-nano on intent classification
# Could improve accuracy to 99.5%+
# TBD: Collect 1000 examples, evaluate ROI

Testing Model Performance

Benchmark Checklist

# Test: Intent Detection Accuracy
# Models to compare: gpt-4.1-nano, gpt-5-nano
# Metric: % correct classification
# Sample: 50 diverse requests

# Test: Latency Comparison
# Models to compare: gpt-4.1-nano, gpt-5-nano
# Metric: Response time (ms)
# Sample: 10 warm-up, 50 test requests

# Test: Cost per Request
# Calculate: model cost × avg tokens / cost per k tokens
# Compare: gpt-4.1-nano vs gpt-5-nano vs gpt-5

# Test: User Satisfaction
# Metric: Question quality rating (1-5)
# Sample: 20 users, rate pre-creation planning questions

Running Tests

# Test intent detection accuracy
rails test:intent_detection

# Benchmark latency
rails test:model_latency

# Test cost effectiveness
rails test:cost_analysis

References

Files Using Each Model:

gpt-4.1-nano:
- combined_intent_complexity_service.rb:42
- question_generation_service.rb:XX
gpt-5-nano:
- list_complexity_detector_service.rb:XX
- parameter_extraction_service.rb:XX
gpt-5:
- list_refinement_service.rb:67
gpt-5-mini:
- chat_completion_service.rb:98 (default model)

Related Docs:

CHAT_FLOW.md - Overall flow
CHAT_FEATURES.md - Feature guide