Chapter 5: LLM Integration & Configuration

April 13, 2026 · View on GitHub

Welcome to Chapter 5: LLM Integration & Configuration. In this part of RAGFlow Tutorial: Complete Guide to Open-Source RAG Engine, you will build an intuitive mental model first, then move into concrete implementation details and practical production tradeoffs.

Connect RAGFlow with various Large Language Models for intelligent question answering.

🎯 Overview

This chapter covers how to integrate different Large Language Models (LLMs) with RAGFlow to power your RAG applications. You'll learn to configure various LLM providers and optimize their performance for document-based question answering.

🤖 Supported LLM Providers

RAGFlow supports a wide range of LLM providers for different use cases and deployment scenarios:

Cloud Providers

OpenAI - GPT-4, GPT-3.5-turbo
Anthropic - Claude 3, Claude 2
Google - Gemini 1.5, Gemini 1.0
Azure OpenAI - Enterprise-grade deployments
AWS Bedrock - Amazon's LLM service

Local & Self-Hosted

Ollama - Local model inference
LM Studio - Local model management
Hugging Face - Direct model integration
vLLM - High-throughput inference
LocalAI - Unified local AI API

Specialized Providers

Together AI - Optimized inference
Replicate - Model marketplace
Fireworks AI - Fast inference
DeepInfra - Cost-effective models

🔧 Configuration Steps

Step 1: Access LLM Settings

Log into RAGFlow web interface
Navigate to System Settings > Model Providers
Click Add Provider to configure a new LLM

Step 2: Configure API Keys

# Set environment variables for different providers
export OPENAI_API_KEY="your-openai-key"
export ANTHROPIC_API_KEY="your-anthropic-key"
export GOOGLE_API_KEY="your-google-key"

Step 3: Provider-Specific Setup

OpenAI Configuration

{
  "provider": "OpenAI",
  "model": "gpt-4o",
  "api_key": "sk-...",
  "temperature": 0.1,
  "max_tokens": 2000,
  "top_p": 0.9
}

Anthropic Configuration

{
  "provider": "Anthropic",
  "model": "claude-3-5-sonnet-20241022",
  "api_key": "sk-ant-...",
  "temperature": 0.1,
  "max_tokens": 4000,
  "system_prompt": "You are a helpful assistant that answers questions based on provided context."
}

Local Ollama Setup

# Install Ollama
curl -fsSL https://ollama.ai/install.sh | sh

# Pull a model
ollama pull llama3.1:8b

# Start Ollama service
ollama serve

Then configure in RAGFlow:

{
  "provider": "Ollama",
  "model": "llama3.1:8b",
  "base_url": "http://localhost:11434",
  "temperature": 0.1,
  "num_ctx": 4096
}

🎛️ Advanced LLM Configuration

Temperature & Sampling

{
  "temperature": 0.1,        // Lower = more deterministic
  "top_p": 0.9,              // Nucleus sampling
  "top_k": 40,               // Top-k sampling
  "repetition_penalty": 1.1, // Reduce repetition
  "max_tokens": 2000         // Response length limit
}

Context Window Management

{
  "max_context_length": 8192,    // Maximum context tokens
  "overlap_size": 200,           // Chunk overlap for retrieval
  "compression_ratio": 0.7,      // Context compression
  "hierarchical_retrieval": true // Multi-level retrieval
}

🔄 Model Switching & Fallbacks

Primary-Secondary Model Setup

{
  "models": [
    {
      "name": "gpt-4o",
      "provider": "OpenAI",
      "priority": 1,
      "fallback": false
    },
    {
      "name": "claude-3-5-sonnet",
      "provider": "Anthropic",
      "priority": 2,
      "fallback": true
    },
    {
      "name": "llama3.1:8b",
      "provider": "Ollama",
      "priority": 3,
      "fallback": true
    }
  ]
}

Load Balancing Configuration

{
  "load_balancing": {
    "enabled": true,
    "strategy": "round_robin",
    "health_check_interval": 30,
    "timeout": 10
  }
}

📊 Performance Optimization

Caching Strategies

{
  "caching": {
    "enabled": true,
    "ttl": 3600,              // Cache TTL in seconds
    "max_cache_size": "1GB",  // Maximum cache size
    "compression": true       // Enable response compression
  }
}

Batch Processing

{
  "batch_processing": {
    "enabled": true,
    "max_batch_size": 10,
    "timeout": 30,
    "concurrency_limit": 5
  }
}

🎯 Use Case Optimization

Different Configurations for Different Tasks

Document Q&A

{
  "task": "document_qa",
  "model": "gpt-4o",
  "temperature": 0.1,
  "max_tokens": 1000,
  "system_prompt": "Answer questions based solely on the provided document context."
}

Creative Writing

{
  "task": "creative_writing",
  "model": "claude-3-5-sonnet",
  "temperature": 0.8,
  "max_tokens": 2000,
  "system_prompt": "Generate creative content while staying relevant to the document context."
}

Code Generation

{
  "task": "code_generation",
  "model": "gpt-4o",
  "temperature": 0.2,
  "max_tokens": 1500,
  "system_prompt": "Generate code based on the documentation and requirements provided."
}

🔍 Monitoring & Analytics

Response Quality Metrics

{
  "monitoring": {
    "response_time_tracking": true,
    "token_usage_monitoring": true,
    "quality_scoring": true,
    "error_rate_tracking": true
  }
}

Custom Metrics Dashboard

{
  "dashboard": {
    "real_time_metrics": true,
    "historical_trends": true,
    "model_comparison": true,
    "cost_analysis": true
  }
}

🛠️ Troubleshooting Common Issues

API Rate Limits

# Monitor rate limits
curl -X GET "http://localhost:80/api/rate-limits" \
  -H "Authorization: Bearer YOUR_TOKEN"

# Implement exponential backoff
# RAGFlow handles this automatically with retry logic

Model Compatibility Issues

# Check model compatibility
curl -X POST "http://localhost:80/api/models/check-compatibility" \
  -H "Content-Type: application/json" \
  -d '{"model": "gpt-4o", "task": "document_qa"}'

Context Window Overflow

{
  "error_handling": {
    "context_overflow_strategy": "truncate",
    "chunk_reduction_ratio": 0.8,
    "fallback_model": "gpt-3.5-turbo"
  }
}

🔐 Security Best Practices

API Key Management

# Use environment variables
export RAGFLOW_ENCRYPTION_KEY="your-encryption-key"

# Rotate keys regularly
curl -X POST "http://localhost:80/api/keys/rotate" \
  -H "Authorization: Bearer ADMIN_TOKEN"

Model Access Control

{
  "access_control": {
    "user_roles": ["admin", "editor", "viewer"],
    "model_permissions": {
      "gpt-4": ["admin", "editor"],
      "claude-3": ["admin", "editor", "viewer"],
      "local-models": ["admin", "editor", "viewer"]
    }
  }
}

🚀 Production Deployment Considerations

High Availability Setup

# docker-compose.prod.yml
version: '3.8'
services:
  ragflow:
    image: infiniflow/ragflow:latest
    environment:
      - LLM_PROVIDER_BACKUP=true
      - LOAD_BALANCER_ENABLED=true
      - CACHE_LAYER=redis
    depends_on:
      - redis
      - postgres

  redis:
    image: redis:7-alpine

  postgres:
    image: postgres:15

Scaling Strategies

{
  "scaling": {
    "auto_scaling": true,
    "min_instances": 2,
    "max_instances": 10,
    "cpu_threshold": 70,
    "memory_threshold": 80
  }
}

🎯 Best Practices

Model Selection Guidelines

Use Case	Recommended Models	Rationale
Document Q&A	GPT-4, Claude 3	High accuracy, good context understanding
Creative Tasks	Claude 3, GPT-4	Better at generating natural, creative responses
Code Generation	GPT-4, Claude 3	Strong code understanding and generation
Cost-Effective	GPT-3.5, Claude Instant	Good balance of cost and performance
Local/Offline	Llama 3, Mistral	Privacy-focused, no API costs

Performance Optimization Tips

Use Appropriate Model Sizes: Larger models for complex tasks, smaller for simple queries
Implement Caching: Cache frequent queries and responses
Monitor Usage: Track token consumption and costs
Load Balancing: Distribute requests across multiple model instances
Fallback Strategies: Have backup models for reliability

📈 Next Steps

Now that you have configured LLMs for RAGFlow, you're ready to:

Chapter 6: Chatbot Development - Build conversational interfaces
Chapter 7: Advanced Features - Explore advanced RAGFlow capabilities
Chapter 8: Production Deployment - Deploy at scale

Ready to build intelligent chatbots? Continue to Chapter 6: Chatbot Development! 🚀

What Problem Does This Solve?

Most teams struggle here because the hard part is not writing more code, but deciding clear boundaries for model, temperature, provider so behavior stays predictable as complexity grows.

In practical terms, this chapter helps you avoid three common failures:

coupling core logic too tightly to one implementation path
missing the handoff boundaries between setup, execution, and validation
shipping changes without clear rollback or observability strategy

After working through this chapter, you should be able to reason about Chapter 5: LLM Integration & Configuration as an operating subsystem inside RAGFlow Tutorial: Complete Guide to Open-Source RAG Engine, with explicit contracts for inputs, state transitions, and outputs.

Use the implementation notes around max_tokens, your, claude as your checklist when adapting these patterns to your own repository.

How it Works Under the Hood

Under the hood, Chapter 5: LLM Integration & Configuration usually follows a repeatable control path:

Context bootstrap: initialize runtime config and prerequisites for model.
Input normalization: shape incoming data so temperature receives stable contracts.
Core execution: run the main logic branch and propagate intermediate state through provider.
Policy and safety checks: enforce limits, auth scopes, and failure boundaries.
Output composition: return canonical result payloads for downstream consumers.
Operational telemetry: emit logs/metrics needed for debugging and performance tuning.

When debugging, walk this sequence in order and confirm each stage has explicit success/failure conditions.

LLM Integration Architecture

flowchart LR
    A[RAGFlow Backend] --> B{LLM Provider}
    B --> C[OpenAI GPT-4o]
    B --> D[Anthropic Claude]
    B --> E[Ollama Local]
    B --> F[Azure OpenAI]
    C --> G[Answer Generation]
    D --> G
    E --> G
    F --> G
    G --> H[RAG Response with Citations]

Source Walkthrough

Use the following upstream sources to verify implementation details while reading this chapter:

GitHub Repository Why it matters: authoritative reference on GitHub Repository (github.com).
AI Codebase Knowledge Builder Why it matters: authoritative reference on AI Codebase Knowledge Builder (github.com).

Suggested trace strategy:

search upstream code for model and temperature to map concrete implementation paths
compare docs claims against actual runtime/config code before reusing patterns in production