Chapter 5: LLM Integration & Configuration
April 13, 2026 ยท View on GitHub
Welcome to Chapter 5: LLM Integration & Configuration. In this part of RAGFlow Tutorial: Complete Guide to Open-Source RAG Engine, you will build an intuitive mental model first, then move into concrete implementation details and practical production tradeoffs.
Connect RAGFlow with various Large Language Models for intelligent question answering.
๐ฏ Overview
This chapter covers how to integrate different Large Language Models (LLMs) with RAGFlow to power your RAG applications. You'll learn to configure various LLM providers and optimize their performance for document-based question answering.
๐ค Supported LLM Providers
RAGFlow supports a wide range of LLM providers for different use cases and deployment scenarios:
Cloud Providers
- OpenAI - GPT-4, GPT-3.5-turbo
- Anthropic - Claude 3, Claude 2
- Google - Gemini 1.5, Gemini 1.0
- Azure OpenAI - Enterprise-grade deployments
- AWS Bedrock - Amazon's LLM service
Local & Self-Hosted
- Ollama - Local model inference
- LM Studio - Local model management
- Hugging Face - Direct model integration
- vLLM - High-throughput inference
- LocalAI - Unified local AI API
Specialized Providers
- Together AI - Optimized inference
- Replicate - Model marketplace
- Fireworks AI - Fast inference
- DeepInfra - Cost-effective models
๐ง Configuration Steps
Step 1: Access LLM Settings
- Log into RAGFlow web interface
- Navigate to System Settings > Model Providers
- Click Add Provider to configure a new LLM
Step 2: Configure API Keys
# Set environment variables for different providers
export OPENAI_API_KEY="your-openai-key"
export ANTHROPIC_API_KEY="your-anthropic-key"
export GOOGLE_API_KEY="your-google-key"
Step 3: Provider-Specific Setup
OpenAI Configuration
{
"provider": "OpenAI",
"model": "gpt-4o",
"api_key": "sk-...",
"temperature": 0.1,
"max_tokens": 2000,
"top_p": 0.9
}
Anthropic Configuration
{
"provider": "Anthropic",
"model": "claude-3-5-sonnet-20241022",
"api_key": "sk-ant-...",
"temperature": 0.1,
"max_tokens": 4000,
"system_prompt": "You are a helpful assistant that answers questions based on provided context."
}
Local Ollama Setup
# Install Ollama
curl -fsSL https://ollama.ai/install.sh | sh
# Pull a model
ollama pull llama3.1:8b
# Start Ollama service
ollama serve
Then configure in RAGFlow:
{
"provider": "Ollama",
"model": "llama3.1:8b",
"base_url": "http://localhost:11434",
"temperature": 0.1,
"num_ctx": 4096
}
๐๏ธ Advanced LLM Configuration
Temperature & Sampling
{
"temperature": 0.1, // Lower = more deterministic
"top_p": 0.9, // Nucleus sampling
"top_k": 40, // Top-k sampling
"repetition_penalty": 1.1, // Reduce repetition
"max_tokens": 2000 // Response length limit
}
Context Window Management
{
"max_context_length": 8192, // Maximum context tokens
"overlap_size": 200, // Chunk overlap for retrieval
"compression_ratio": 0.7, // Context compression
"hierarchical_retrieval": true // Multi-level retrieval
}
๐ Model Switching & Fallbacks
Primary-Secondary Model Setup
{
"models": [
{
"name": "gpt-4o",
"provider": "OpenAI",
"priority": 1,
"fallback": false
},
{
"name": "claude-3-5-sonnet",
"provider": "Anthropic",
"priority": 2,
"fallback": true
},
{
"name": "llama3.1:8b",
"provider": "Ollama",
"priority": 3,
"fallback": true
}
]
}
Load Balancing Configuration
{
"load_balancing": {
"enabled": true,
"strategy": "round_robin",
"health_check_interval": 30,
"timeout": 10
}
}
๐ Performance Optimization
Caching Strategies
{
"caching": {
"enabled": true,
"ttl": 3600, // Cache TTL in seconds
"max_cache_size": "1GB", // Maximum cache size
"compression": true // Enable response compression
}
}
Batch Processing
{
"batch_processing": {
"enabled": true,
"max_batch_size": 10,
"timeout": 30,
"concurrency_limit": 5
}
}
๐ฏ Use Case Optimization
Different Configurations for Different Tasks
Document Q&A
{
"task": "document_qa",
"model": "gpt-4o",
"temperature": 0.1,
"max_tokens": 1000,
"system_prompt": "Answer questions based solely on the provided document context."
}
Creative Writing
{
"task": "creative_writing",
"model": "claude-3-5-sonnet",
"temperature": 0.8,
"max_tokens": 2000,
"system_prompt": "Generate creative content while staying relevant to the document context."
}
Code Generation
{
"task": "code_generation",
"model": "gpt-4o",
"temperature": 0.2,
"max_tokens": 1500,
"system_prompt": "Generate code based on the documentation and requirements provided."
}
๐ Monitoring & Analytics
Response Quality Metrics
{
"monitoring": {
"response_time_tracking": true,
"token_usage_monitoring": true,
"quality_scoring": true,
"error_rate_tracking": true
}
}
Custom Metrics Dashboard
{
"dashboard": {
"real_time_metrics": true,
"historical_trends": true,
"model_comparison": true,
"cost_analysis": true
}
}
๐ ๏ธ Troubleshooting Common Issues
API Rate Limits
# Monitor rate limits
curl -X GET "http://localhost:80/api/rate-limits" \
-H "Authorization: Bearer YOUR_TOKEN"
# Implement exponential backoff
# RAGFlow handles this automatically with retry logic
Model Compatibility Issues
# Check model compatibility
curl -X POST "http://localhost:80/api/models/check-compatibility" \
-H "Content-Type: application/json" \
-d '{"model": "gpt-4o", "task": "document_qa"}'
Context Window Overflow
{
"error_handling": {
"context_overflow_strategy": "truncate",
"chunk_reduction_ratio": 0.8,
"fallback_model": "gpt-3.5-turbo"
}
}
๐ Security Best Practices
API Key Management
# Use environment variables
export RAGFLOW_ENCRYPTION_KEY="your-encryption-key"
# Rotate keys regularly
curl -X POST "http://localhost:80/api/keys/rotate" \
-H "Authorization: Bearer ADMIN_TOKEN"
Model Access Control
{
"access_control": {
"user_roles": ["admin", "editor", "viewer"],
"model_permissions": {
"gpt-4": ["admin", "editor"],
"claude-3": ["admin", "editor", "viewer"],
"local-models": ["admin", "editor", "viewer"]
}
}
}
๐ Production Deployment Considerations
High Availability Setup
# docker-compose.prod.yml
version: '3.8'
services:
ragflow:
image: infiniflow/ragflow:latest
environment:
- LLM_PROVIDER_BACKUP=true
- LOAD_BALANCER_ENABLED=true
- CACHE_LAYER=redis
depends_on:
- redis
- postgres
redis:
image: redis:7-alpine
postgres:
image: postgres:15
Scaling Strategies
{
"scaling": {
"auto_scaling": true,
"min_instances": 2,
"max_instances": 10,
"cpu_threshold": 70,
"memory_threshold": 80
}
}
๐ฏ Best Practices
Model Selection Guidelines
| Use Case | Recommended Models | Rationale |
|---|---|---|
| Document Q&A | GPT-4, Claude 3 | High accuracy, good context understanding |
| Creative Tasks | Claude 3, GPT-4 | Better at generating natural, creative responses |
| Code Generation | GPT-4, Claude 3 | Strong code understanding and generation |
| Cost-Effective | GPT-3.5, Claude Instant | Good balance of cost and performance |
| Local/Offline | Llama 3, Mistral | Privacy-focused, no API costs |
Performance Optimization Tips
- Use Appropriate Model Sizes: Larger models for complex tasks, smaller for simple queries
- Implement Caching: Cache frequent queries and responses
- Monitor Usage: Track token consumption and costs
- Load Balancing: Distribute requests across multiple model instances
- Fallback Strategies: Have backup models for reliability
๐ Next Steps
Now that you have configured LLMs for RAGFlow, you're ready to:
- Chapter 6: Chatbot Development - Build conversational interfaces
- Chapter 7: Advanced Features - Explore advanced RAGFlow capabilities
- Chapter 8: Production Deployment - Deploy at scale
Ready to build intelligent chatbots? Continue to Chapter 6: Chatbot Development! ๐
What Problem Does This Solve?
Most teams struggle here because the hard part is not writing more code, but deciding clear boundaries for model, temperature, provider so behavior stays predictable as complexity grows.
In practical terms, this chapter helps you avoid three common failures:
- coupling core logic too tightly to one implementation path
- missing the handoff boundaries between setup, execution, and validation
- shipping changes without clear rollback or observability strategy
After working through this chapter, you should be able to reason about Chapter 5: LLM Integration & Configuration as an operating subsystem inside RAGFlow Tutorial: Complete Guide to Open-Source RAG Engine, with explicit contracts for inputs, state transitions, and outputs.
Use the implementation notes around max_tokens, your, claude as your checklist when adapting these patterns to your own repository.
How it Works Under the Hood
Under the hood, Chapter 5: LLM Integration & Configuration usually follows a repeatable control path:
- Context bootstrap: initialize runtime config and prerequisites for
model. - Input normalization: shape incoming data so
temperaturereceives stable contracts. - Core execution: run the main logic branch and propagate intermediate state through
provider. - Policy and safety checks: enforce limits, auth scopes, and failure boundaries.
- Output composition: return canonical result payloads for downstream consumers.
- Operational telemetry: emit logs/metrics needed for debugging and performance tuning.
When debugging, walk this sequence in order and confirm each stage has explicit success/failure conditions.
LLM Integration Architecture
flowchart LR
A[RAGFlow Backend] --> B{LLM Provider}
B --> C[OpenAI GPT-4o]
B --> D[Anthropic Claude]
B --> E[Ollama Local]
B --> F[Azure OpenAI]
C --> G[Answer Generation]
D --> G
E --> G
F --> G
G --> H[RAG Response with Citations]
Source Walkthrough
Use the following upstream sources to verify implementation details while reading this chapter:
- GitHub Repository
Why it matters: authoritative reference on
GitHub Repository(github.com). - AI Codebase Knowledge Builder
Why it matters: authoritative reference on
AI Codebase Knowledge Builder(github.com).
Suggested trace strategy:
- search upstream code for
modelandtemperatureto map concrete implementation paths - compare docs claims against actual runtime/config code before reusing patterns in production