Chat Model Selection Strategy
March 13, 2026 · View on GitHub
Why different models are used for different tasks in Listopia's chat system.
Quick Reference Table
| Task | Model | Latency | Why This Model |
|---|---|---|---|
| Intent + Complexity Detection | gpt-4.1-nano | ~2s | Fast classification, low cost |
| Question Generation (Pre-creation) | gpt-4.1-nano | ~1-2s | Template-based, speed is priority |
| List Refinement Questions | gpt-5 | ~2-3s | Reliability > speed (critical UX) |
| General Chat Responses | gpt-5-mini | ~2-3s | Balanced capability/speed |
| Parameter Extraction | gpt-5-nano | ~1s | Structured parsing needed |
| Tool Calling (Advanced) | gpt-5-mini | ~2-3s | Tools need full reasoning |
Model Hierarchy in Listopia
Speed ──────────────────────────────────────► Capability
↓ ↑
gpt-4.1-nano (Fast, cheap, simple)
├─ Use for: Classification, templates
├─ Speed: ~0.5-2s
└─ Cost: Lowest
gpt-5-nano (Fast, capable)
├─ Use for: Extraction, detection
├─ Speed: ~1-2s
└─ Cost: Low-medium
gpt-5-mini (Balanced)
├─ Use for: General conversation, tool calling
├─ Speed: ~2-3s
└─ Cost: Medium
gpt-5 (Full capability)
├─ Use for: Critical features, extended reasoning
├─ Speed: ~2-4s
└─ Cost: Higher
Why gpt-4.1-nano for Intent Detection?
The Task
Classify user intent into categories:
- Is this a list creation request?
- Is it complex or simple?
- What parameters are mentioned?
- What's the confidence level?
Why gpt-4.1-nano (Not gpt-5-nano)?
1. Task Type: Classification, Not Reasoning
Intent detection is CLASSIFICATION:
INPUT: "Help me plan a US roadshow"
CLASSIFY:
✓ Intent? → "create_list"
✓ Complex? → true (multi_location indicator)
✓ Domain? → "event"
NOT reasoning:
INPUT: "Help me plan a US roadshow"
REASON ABOUT:
✗ Why would they want a roadshow?
✗ What are their hidden objectives?
✗ What would make this successful?
✗ What are they really trying to solve?
Result: gpt-4.1-nano is perfect for classification tasks.
2. Performance: 33% Faster
Measured optimization:
3 Separate Calls (Old):
AiIntentRouterService: 1.5s
ListComplexityDetectorService: 0.5s
ParameterExtractionService: 1.0s
─────────────────────────────────
Total: 3.0s ❌
1 Combined Call (New - gpt-4.1-nano):
CombinedIntentComplexityService: 2.0s ✅
Savings: 1.0s (33% faster)
3. Cost Efficiency
gpt-4.1-nano is the cheapest model while still being capable:
- 50% cheaper than gpt-5-nano
- 10x cheaper than gpt-5
- Still 99%+ accurate for classification
4. Sufficient Accuracy for Classification
Classification accuracy matrix:
| Task | gpt-4.1-nano | gpt-5-nano | Difference |
|---|---|---|---|
| Intent detection | 99% | 99.5% | Negligible |
| Complexity (simple/complex) | 95% | 96% | <1% |
| Parameter extraction | 92% | 95% | Acceptable |
| Domain classification | 98% | 98.5% | Negligible |
Conclusion: The 0.5-3% accuracy difference doesn't justify 50-100% latency increase.
When NOT to Use gpt-4.1-nano
❌ Extended reasoning needed ❌ Complex parsing required ❌ Multi-step logical deduction ❌ Ambiguous or creative tasks ❌ Tool calling (needs full capability)
Why gpt-5 for List Refinement (Not gpt-5-nano)?
The Critical Decision
File: app/services/list_refinement_service.rb:67
# Use gpt-5 for reliable question generation
# This is a critical user-facing feature that needs to work correctly
llm_chat = RubyLLM::Chat.new(provider: :openai, model: "gpt-5")
Why the Exception?
1. Reliability Over Speed
This feature is user-visible and critical:
- User creates a list
- System should ask good clarifying questions
- If this fails, list is incomplete
- User experience depends on it working
Historical issue:
gpt-5-nano + extended thinking
→ JSON parsing failures
→ Questions not generated
→ User sees blank form ❌
gpt-5 (no extended thinking needed)
→ Reliable JSON output
→ Questions always generated
→ User sees helpful form ✅
Result: We accept 2-3 second latency for 100% reliability.
2. Domain-Specific Knowledge
List refinement needs to understand:
- Professional vs personal context
- Travel planning specifics
- Learning preferences
- Project management terms
- Event planning details
gpt-5 has better understanding of these domains than gpt-5-nano.
3. Quality of Questions
Example output quality:
gpt-5-nano:
Q1: "How long?"
Q2: "Where?"
Q3: "Cost?"
(Generic, short, not helpful)
gpt-5:
Q1: "What's the primary business objective?
(Sales, marketing, training, relationship building)"
Q2: "Which cities will you visit and how long at each stop?"
Q3: "Will each city have customized presentations or same content?"
(Specific, contextual, actionable)
4. User Trust
When a system shows users a form asking questions, those questions must be:
- ✓ Relevant
- ✓ Helpful
- ✓ Professional
- ✗ Generic or vague
gpt-5 delivers on all counts. gpt-5-nano can miss sometimes.
Model Selection Decision Tree
╔════════════════════════════════════════════╗
║ New Task: Which Model to Use? ║
╚════════════════════════════════════════════╝
│
▼
Is this a simple classification?
(intent? category? domain?)
YES ↓ ↓ NO
│ │
▼ ▼
gpt-4.1 Need structured
-nano output (JSON)?
│ │
│ YES ↓ NO
│ │ │
│ ▼ ▼
│ gpt-5 Is this
│ -nano user-critical?
│ │ │
│ │ YES ↓ NO
│ │ │ │
│ │ ▼ ▼
│ │ gpt-5 gpt-5
│ │ (fast) -mini
│ │ │
└──────┴───────┴────→ Selected Model
Cost vs Latency Trade-offs
Optimization 1: CombinedIntentComplexityService
Decision: Use gpt-4.1-nano for 3 tasks in 1 call
Cost Analysis:
Old (3 calls):
3 × (cost of gpt-5-nano) = 3×\$0.0015 = \$0.0045 per request
New (1 call):
1 × (cost of gpt-4.1-nano) = 1×\$0.0005 = \$0.0005 per request
Savings: 89% cost reduction ✅
Speed:
Old: 3.0 seconds
New: 2.0 seconds
Savings: 33% faster ✅
Optimization 2: ListRefinementService Reliability
Decision: Use gpt-5 instead of gpt-5-nano
Cost Analysis:
gpt-5-nano: \$0.0015 per request
gpt-5: \$0.003 per request
Additional cost: \$0.0015 per request (100% more)
Speed Impact:
gpt-5-nano: ~2s
gpt-5: ~2-3s
Additional latency: 0-1s
Value Proposition:
- 99%+ reliability (vs 95% with nano)
- Professional-quality questions
- Zero form generation failures
Cost of failure: User sees blank form, abandons feature ❌
Cost of extra \$0.0015: Negligible per request
Decision: Extra cost is worth 100% reliability ✅
Performance Impact of Model Choices
Baseline (if we used slowest model for everything)
Intent detection + question generation (gpt-5):
Step 1: 3.5s
Step 2: 3.5s
Step 3: 2.0s (user answers)
Step 4: 3.5s (create list)
──────────────
Total: ~7-8 seconds ❌
User experience: Slow
Optimized (current strategy)
Intent detection (gpt-4.1-nano): 2.0s
Question generation (gpt-4.1-nano): 1-2s
User answers: User time
Create list (sync): 0.2s
──────────────────────────────────
Perceived latency: 1-2s ✅ (user sees form immediately)
User experience: Fast, responsive
Savings
33% faster intent detection = responsive UI 66% cost reduction = sustainable at scale
When to Change Model Selection
Escalate to Stronger Model If:
-
Accuracy drops below threshold
- If intent detection fails >5% of time
- If parameter extraction misses critical info
-
User complaints about quality
- "Questions aren't relevant to my situation"
- "The list wasn't created right"
- "It doesn't understand my context"
-
New capability needed
- Multi-step reasoning required
- Complex parsing needed
- Tool calling with complex logic
Downgrade to Faster Model If:
-
Performance becomes a bottleneck
- System hitting rate limits
- Users complaining about latency
- Cost exceeds budget
-
Accuracy allows
- If 90% accuracy is acceptable for some task
- If failures have no real impact
-
Testing shows viable
- Benchmark both models first
- Compare accuracy on real data
- Measure user-perceived latency
Future Optimization Opportunities
1. Caching Intent Results
# Cache intent for identical requests
# "Plan a roadshow" → always create_list
# TBD: Implement Redis caching by message hash
2. Streaming Responses
# Stream question form to user as it's being generated
# Current: Wait 1-2s, show all questions
# Better: Show questions as they arrive (0.3-0.5s perceived)
3. Model Auto-Selection
# Choose model based on request type
# Example: "add user" might use gpt-4.1-nano
# "plan event" might use gpt-5-mini
# TBD: Implement heuristic-based selection
4. Fine-tuning
# Fine-tune gpt-4.1-nano on intent classification
# Could improve accuracy to 99.5%+
# TBD: Collect 1000 examples, evaluate ROI
Testing Model Performance
Benchmark Checklist
# Test: Intent Detection Accuracy
# Models to compare: gpt-4.1-nano, gpt-5-nano
# Metric: % correct classification
# Sample: 50 diverse requests
# Test: Latency Comparison
# Models to compare: gpt-4.1-nano, gpt-5-nano
# Metric: Response time (ms)
# Sample: 10 warm-up, 50 test requests
# Test: Cost per Request
# Calculate: model cost × avg tokens / cost per k tokens
# Compare: gpt-4.1-nano vs gpt-5-nano vs gpt-5
# Test: User Satisfaction
# Metric: Question quality rating (1-5)
# Sample: 20 users, rate pre-creation planning questions
Running Tests
# Test intent detection accuracy
rails test:intent_detection
# Benchmark latency
rails test:model_latency
# Test cost effectiveness
rails test:cost_analysis
References
Files Using Each Model:
-
gpt-4.1-nano:
combined_intent_complexity_service.rb:42question_generation_service.rb:XX
-
gpt-5-nano:
list_complexity_detector_service.rb:XXparameter_extraction_service.rb:XX
-
gpt-5:
list_refinement_service.rb:67
-
gpt-5-mini:
chat_completion_service.rb:98(default model)
Related Docs:
- CHAT_FLOW.md - Overall flow
- CHAT_FEATURES.md - Feature guide