Chapter 8: Production Deployment

March 2, 2026 ยท View on GitHub

Welcome to Chapter 8: Production Deployment. In this part of Quivr Tutorial: Open-Source RAG Framework for Document Ingestion, you will build an intuitive mental model first, then move into concrete implementation details and practical production tradeoffs.

In Chapter 7, you customized Quivr with domain-specific processors, rerankers, prompts, and plugins. Now it is time to take everything to production. A development setup running on localhost is fine for experimentation, but serving real users at scale requires proper containerization, infrastructure design, security hardening, monitoring, and cost management.

This chapter covers the complete journey from a single Docker container to a production-grade deployment: infrastructure architecture, Docker and Kubernetes configurations, database and vector store scaling, security hardening, observability, performance tuning, backup strategies, and a comprehensive go-live checklist.

Production Architecture

flowchart TD
    A[Users] --> B[Load Balancer / CDN]
    B --> C[API Gateway]
    C --> D[Auth Service]
    C --> E[Quivr API Cluster]

    E --> F[Ingestion Workers]
    E --> G[Query Workers]

    F --> H[Document Queue]
    H --> I[Processing Pipeline]
    I --> J[Embedding Service]

    J --> K[(Vector Database)]
    E --> K
    G --> K

    E --> L[(PostgreSQL)]
    E --> M[LLM Provider]
    E --> N[Object Storage]

    O[Monitoring Stack] --> E
    O --> K
    O --> L

    classDef user fill:#e1f5fe,stroke:#01579b
    classDef gateway fill:#f3e5f5,stroke:#4a148c
    classDef service fill:#fff3e0,stroke:#ef6c00
    classDef data fill:#e8f5e8,stroke:#1b5e20
    classDef monitor fill:#fce4ec,stroke:#c62828

    class A user
    class B,C,D gateway
    class E,F,G,H,I,J service
    class K,L,M,N data
    class O monitor

Docker Deployment

Production Docker Compose

# docker-compose.prod.yml
version: "3.8"

services:
  api:
    image: quivrhq/quivr-api:${QUIVR_VERSION:-latest}
    restart: always
    ports:
      - "8000:8000"
    env_file: .env.production
    environment:
      - WORKERS=4
      - MAX_REQUESTS=1000
      - MAX_REQUESTS_JITTER=50
      - TIMEOUT=120
      - GRACEFUL_TIMEOUT=30
    depends_on:
      db:
        condition: service_healthy
      vectordb:
        condition: service_healthy
      redis:
        condition: service_healthy
    deploy:
      resources:
        limits:
          cpus: "4"
          memory: 8G
        reservations:
          cpus: "2"
          memory: 4G
    healthcheck:
      test: ["CMD", "curl", "-f", "http://localhost:8000/health"]
      interval: 30s
      timeout: 10s
      retries: 3
      start_period: 40s
    volumes:
      - upload_tmp:/tmp/uploads
    networks:
      - quivr-net

  worker:
    image: quivrhq/quivr-worker:${QUIVR_VERSION:-latest}
    restart: always
    env_file: .env.production
    environment:
      - CELERY_CONCURRENCY=4
      - CELERY_MAX_TASKS_PER_CHILD=100
    depends_on:
      - api
      - redis
    deploy:
      replicas: 2
      resources:
        limits:
          cpus: "4"
          memory: 8G
    networks:
      - quivr-net

  db:
    image: postgres:15-alpine
    restart: always
    environment:
      POSTGRES_DB: quivr
      POSTGRES_USER: ${DB_USER}
      POSTGRES_PASSWORD: ${DB_PASSWORD}
    volumes:
      - postgres_data:/var/lib/postgresql/data
      - ./init.sql:/docker-entrypoint-initdb.d/init.sql
    healthcheck:
      test: ["CMD-SHELL", "pg_isready -U ${DB_USER}"]
      interval: 10s
      timeout: 5s
      retries: 5
    deploy:
      resources:
        limits:
          cpus: "2"
          memory: 4G
    networks:
      - quivr-net

  vectordb:
    image: qdrant/qdrant:v1.7.4
    restart: always
    volumes:
      - qdrant_data:/qdrant/storage
    environment:
      - QDRANT__SERVICE__GRPC_PORT=6334
      - QDRANT__SERVICE__HTTP_PORT=6333
      - QDRANT__STORAGE__STORAGE_PATH=/qdrant/storage
      - QDRANT__STORAGE__OPTIMIZERS__INDEXING_THRESHOLD=20000
    healthcheck:
      test: ["CMD", "curl", "-f", "http://localhost:6333/healthz"]
      interval: 10s
      timeout: 5s
      retries: 5
    deploy:
      resources:
        limits:
          cpus: "4"
          memory: 16G
        reservations:
          memory: 8G
    networks:
      - quivr-net

  redis:
    image: redis:7-alpine
    restart: always
    command: redis-server --maxmemory 1gb --maxmemory-policy allkeys-lru
    volumes:
      - redis_data:/data
    healthcheck:
      test: ["CMD", "redis-cli", "ping"]
      interval: 10s
      timeout: 5s
      retries: 5
    networks:
      - quivr-net

  nginx:
    image: nginx:alpine
    restart: always
    ports:
      - "443:443"
      - "80:80"
    volumes:
      - ./nginx.conf:/etc/nginx/nginx.conf:ro
      - ./certs:/etc/nginx/certs:ro
    depends_on:
      - api
    networks:
      - quivr-net

volumes:
  postgres_data:
  qdrant_data:
  redis_data:
  upload_tmp:

networks:
  quivr-net:
    driver: bridge

Production Environment Variables

# .env.production configuration reference
production_config = {
    # Database
    "DB_HOST": "db",
    "DB_PORT": "5432",
    "DB_NAME": "quivr",
    "DB_USER": "quivr_app",
    "DB_PASSWORD": "use-a-secret-manager",  # Never hardcode
    "DB_POOL_SIZE": "20",
    "DB_MAX_OVERFLOW": "10",

    # Vector Database
    "VECTOR_DB_HOST": "vectordb",
    "VECTOR_DB_PORT": "6333",
    "VECTOR_DB_COLLECTION": "quivr-production",

    # Redis
    "REDIS_URL": "redis://redis:6379/0",
    "CELERY_BROKER_URL": "redis://redis:6379/1",

    # LLM Provider
    "OPENAI_API_KEY": "use-a-secret-manager",
    "LLM_MODEL": "gpt-4-turbo-preview",
    "LLM_TEMPERATURE": "0.3",
    "LLM_MAX_TOKENS": "1000",

    # Embedding
    "EMBEDDING_MODEL": "text-embedding-3-small",
    "EMBEDDING_DIMENSIONS": "1536",
    "EMBEDDING_BATCH_SIZE": "100",

    # Security
    "SECRET_KEY": "use-a-secret-manager",
    "CORS_ORIGINS": "https://app.yourcompany.com",
    "MAX_UPLOAD_SIZE_MB": "25",
    "RATE_LIMIT_RPM": "100",

    # Observability
    "LOG_LEVEL": "INFO",
    "LOG_FORMAT": "json",
    "SENTRY_DSN": "https://your-sentry-dsn",
    "PROMETHEUS_ENABLED": "true",
}

Kubernetes Deployment

For larger-scale deployments, Kubernetes provides auto-scaling, rolling updates, and self-healing.

flowchart LR
    A[Ingress Controller] --> B[API Service]
    B --> C[API Pods x3]
    A --> D[Worker Service]
    D --> E[Worker Pods x2]

    C --> F[PostgreSQL StatefulSet]
    C --> G[Qdrant StatefulSet]
    C --> H[Redis Deployment]
    E --> H

    I[HPA] --> C
    I --> E

    classDef ingress fill:#e1f5fe,stroke:#01579b
    classDef service fill:#f3e5f5,stroke:#4a148c
    classDef stateful fill:#e8f5e8,stroke:#1b5e20
    classDef autoscale fill:#fff3e0,stroke:#ef6c00

    class A ingress
    class B,D service
    class C,E service
    class F,G,H stateful
    class I autoscale

Kubernetes API Deployment

# k8s/api-deployment.yaml
apiVersion: apps/v1
kind: Deployment
metadata:
  name: quivr-api
  labels:
    app: quivr
    component: api
spec:
  replicas: 3
  strategy:
    type: RollingUpdate
    rollingUpdate:
      maxSurge: 1
      maxUnavailable: 0
  selector:
    matchLabels:
      app: quivr
      component: api
  template:
    metadata:
      labels:
        app: quivr
        component: api
    spec:
      containers:
        - name: api
          image: quivrhq/quivr-api:1.0.0
          ports:
            - containerPort: 8000
          envFrom:
            - secretRef:
                name: quivr-secrets
            - configMapRef:
                name: quivr-config
          resources:
            requests:
              cpu: "1"
              memory: "2Gi"
            limits:
              cpu: "4"
              memory: "8Gi"
          readinessProbe:
            httpGet:
              path: /health
              port: 8000
            initialDelaySeconds: 15
            periodSeconds: 10
          livenessProbe:
            httpGet:
              path: /health
              port: 8000
            initialDelaySeconds: 30
            periodSeconds: 30
          volumeMounts:
            - name: tmp-uploads
              mountPath: /tmp/uploads
      volumes:
        - name: tmp-uploads
          emptyDir:
            sizeLimit: 5Gi
---
apiVersion: autoscaling/v2
kind: HorizontalPodAutoscaler
metadata:
  name: quivr-api-hpa
spec:
  scaleTargetRef:
    apiVersion: apps/v1
    kind: Deployment
    name: quivr-api
  minReplicas: 2
  maxReplicas: 10
  metrics:
    - type: Resource
      resource:
        name: cpu
        target:
          type: Utilization
          averageUtilization: 70
    - type: Resource
      resource:
        name: memory
        target:
          type: Utilization
          averageUtilization: 80

Security Hardening

Security Configuration

from quivr.security import SecurityConfig

security = SecurityConfig(
    # TLS / HTTPS
    tls_enabled=True,
    tls_cert_path="/etc/ssl/certs/quivr.crt",
    tls_key_path="/etc/ssl/private/quivr.key",
    min_tls_version="1.2",

    # Authentication
    auth_provider="oauth2",          # "api_key", "oauth2", "saml"
    oauth2_issuer="https://auth.yourcompany.com",
    oauth2_audience="quivr-api",
    token_expiry_minutes=60,

    # Authorization
    default_role="viewer",
    enforce_kb_permissions=True,

    # Input validation
    max_upload_size_mb=25,
    allowed_file_types=[".pdf", ".txt", ".md", ".docx", ".html"],
    max_query_length=1000,
    sanitize_inputs=True,

    # Rate limiting
    rate_limit_per_user_rpm=60,
    rate_limit_per_key_rpm=100,
    rate_limit_burst=10,

    # Data protection
    encrypt_at_rest=True,
    encrypt_vectors=False,           # Performance tradeoff
    pii_detection_enabled=True,
    audit_log_enabled=True
)

Security Checklist

CategoryRequirementPriority
TransportTLS 1.2+ for all connectionsCritical
AuthenticationOAuth2 / SAML for usersCritical
AuthenticationScoped API keys with expirationCritical
AuthorizationRBAC on all knowledge basesHigh
Input ValidationFile type and size limitsHigh
Input ValidationQuery length limitsMedium
Data ProtectionEncryption at rest for databasesHigh
Data ProtectionPII detection and maskingMedium
Rate LimitingPer-user and per-key limitsHigh
AuditStructured audit logs for all operationsHigh
NetworkInternal services not exposed publiclyCritical
SecretsAll credentials in secret managerCritical
DependenciesRegular vulnerability scanningMedium

Nginx Reverse Proxy Configuration

# Generate nginx configuration for Quivr
nginx_config = """
upstream quivr_api {
    least_conn;
    server api:8000;
}

server {
    listen 443 ssl http2;
    server_name api.yourcompany.com;

    ssl_certificate     /etc/nginx/certs/fullchain.pem;
    ssl_certificate_key /etc/nginx/certs/privkey.pem;
    ssl_protocols       TLSv1.2 TLSv1.3;
    ssl_ciphers         HIGH:!aNULL:!MD5;

    # Security headers
    add_header Strict-Transport-Security "max-age=31536000" always;
    add_header X-Content-Type-Options "nosniff" always;
    add_header X-Frame-Options "DENY" always;
    add_header X-XSS-Protection "1; mode=block" always;

    # Rate limiting
    limit_req_zone $binary_remote_addr zone=api:10m rate=10r/s;
    limit_req zone=api burst=20 nodelay;

    # Upload size limit
    client_max_body_size 25m;

    location /api/ {
        proxy_pass http://quivr_api;
        proxy_set_header Host $host;
        proxy_set_header X-Real-IP $remote_addr;
        proxy_set_header X-Forwarded-For $proxy_add_x_forwarded_for;
        proxy_set_header X-Forwarded-Proto $scheme;

        # Streaming support
        proxy_buffering off;
        proxy_cache off;
        proxy_read_timeout 300s;
    }

    location /health {
        proxy_pass http://quivr_api/health;
        access_log off;
    }
}

server {
    listen 80;
    server_name api.yourcompany.com;
    return 301 https://$host$request_uri;
}
"""

Observability

Metrics with Prometheus

from quivr.monitoring.metrics import QuivrMetrics

metrics = QuivrMetrics(
    namespace="quivr",
    port=9090
)

# Key metrics exposed automatically
exposed_metrics = {
    # Ingestion metrics
    "quivr_ingestion_total": "Total documents ingested",
    "quivr_ingestion_errors_total": "Total ingestion errors",
    "quivr_ingestion_duration_seconds": "Document processing time",
    "quivr_chunks_created_total": "Total chunks created",

    # Query metrics
    "quivr_queries_total": "Total queries processed",
    "quivr_query_duration_seconds": "End-to-end query latency",
    "quivr_retrieval_duration_seconds": "Vector search latency",
    "quivr_generation_duration_seconds": "LLM generation latency",
    "quivr_query_results_count": "Number of results returned",

    # Resource metrics
    "quivr_active_connections": "Current active connections",
    "quivr_queue_depth": "Ingestion queue depth",
    "quivr_vector_count": "Total vectors stored",
    "quivr_storage_bytes": "Total storage used",

    # LLM metrics
    "quivr_llm_tokens_total": "Total LLM tokens consumed",
    "quivr_llm_cost_dollars": "Estimated LLM cost",
    "quivr_embedding_tokens_total": "Total embedding tokens",
}

Structured Logging

from quivr.monitoring.logging import configure_logging
import structlog

configure_logging(
    level="INFO",
    format="json",
    output="stdout",
    include_fields=[
        "timestamp", "level", "message",
        "request_id", "user_id", "kb_id",
        "latency_ms", "status_code"
    ]
)

logger = structlog.get_logger()

# Example structured log entries
logger.info(
    "query_completed",
    request_id="req-abc123",
    user_id="alice@company.com",
    kb_id="kb-eng",
    query="How do we deploy?",
    latency_ms=1250,
    retrieval_ms=200,
    generation_ms=1000,
    results_count=5,
    tokens_used=1500,
    model="gpt-4-turbo-preview"
)

logger.warning(
    "slow_query",
    request_id="req-def456",
    latency_ms=5000,
    threshold_ms=3000,
    query="complex multi-part question..."
)

logger.error(
    "ingestion_failed",
    filename="corrupt-file.pdf",
    error_type="ExtractionError",
    error_message="PDF file is corrupted or password-protected",
    stack_trace="..."
)

Alerting Rules

# Prometheus alerting rules
alerting_rules = {
    "QuivrHighErrorRate": {
        "expr": "rate(quivr_queries_total{status='error'}[5m]) / "
                "rate(quivr_queries_total[5m]) > 0.05",
        "for": "5m",
        "severity": "critical",
        "summary": "Query error rate exceeds 5%"
    },
    "QuivrSlowQueries": {
        "expr": "histogram_quantile(0.95, "
                "quivr_query_duration_seconds_bucket) > 5",
        "for": "10m",
        "severity": "warning",
        "summary": "P95 query latency exceeds 5 seconds"
    },
    "QuivrQueueBacklog": {
        "expr": "quivr_queue_depth > 1000",
        "for": "15m",
        "severity": "warning",
        "summary": "Ingestion queue depth exceeds 1000"
    },
    "QuivrStorageHigh": {
        "expr": "quivr_storage_bytes / quivr_storage_limit_bytes > 0.85",
        "for": "30m",
        "severity": "warning",
        "summary": "Storage usage exceeds 85%"
    },
    "QuivrVectorDBDown": {
        "expr": "up{job='qdrant'} == 0",
        "for": "1m",
        "severity": "critical",
        "summary": "Vector database is unreachable"
    },
    "QuivrLLMCostSpike": {
        "expr": "increase(quivr_llm_cost_dollars[1h]) > 50",
        "for": "5m",
        "severity": "warning",
        "summary": "LLM cost spike: >\$50 in the last hour"
    }
}

Performance Tuning

Performance Optimization Guide

ComponentParameterDefaultProductionImpact
API WorkersWORKERS12-4 per CPUThroughput
DB PoolDB_POOL_SIZE520Connection throughput
Embedding BatchEMBEDDING_BATCH_SIZE32100Ingestion speed
Vector Indexhnsw:ef_search50100-200Search accuracy
Vector Indexhnsw:m1632Recall vs memory
Redismaxmemory256mb1-4gbCache hit rate
Querytop_k105-8Latency vs recall
UploadMAX_UPLOAD_SIZE_MB10025Security / memory
CeleryCELERY_CONCURRENCY24-8Ingestion throughput

Connection Pooling and Caching

from quivr.performance import PerformanceConfig

perf = PerformanceConfig(
    # Database connection pool
    db_pool_size=20,
    db_max_overflow=10,
    db_pool_timeout=30,
    db_pool_recycle=3600,

    # Redis caching
    cache_enabled=True,
    cache_ttl_seconds=3600,
    cache_max_size_mb=1024,
    cache_strategy="lru",

    # Embedding cache (avoid re-embedding same text)
    embedding_cache_enabled=True,
    embedding_cache_ttl_hours=168,    # 7 days

    # Query result cache
    query_cache_enabled=True,
    query_cache_ttl_seconds=300,      # 5 minutes
    query_cache_max_entries=10000,

    # Connection reuse
    http_keep_alive=True,
    http_connection_pool_size=100,
    llm_connection_pool_size=20
)

Load Testing

from quivr.testing import LoadTest

load_test = LoadTest(
    base_url="https://api.yourcompany.com",
    api_key="load-test-key"
)

# Configure test scenarios
scenarios = [
    {
        "name": "steady_state",
        "queries_per_second": 10,
        "duration_seconds": 300,
        "query_distribution": {
            "simple": 0.6,       # Simple factual queries
            "complex": 0.3,      # Multi-part queries
            "streaming": 0.1     # Streaming queries
        }
    },
    {
        "name": "peak_load",
        "queries_per_second": 50,
        "duration_seconds": 120,
        "ramp_up_seconds": 30
    },
    {
        "name": "ingestion_load",
        "uploads_per_minute": 100,
        "duration_seconds": 600,
        "file_size_range_kb": [10, 5000]
    }
]

# Run the test
results = load_test.run(scenarios)

for scenario in results:
    print(f"\n{'='*50}")
    print(f"Scenario: {scenario.name}")
    print(f"{'='*50}")
    print(f"Total requests: {scenario.total_requests}")
    print(f"Successful: {scenario.successful} ({scenario.success_rate:.1%})")
    print(f"P50 latency: {scenario.p50_ms:.0f}ms")
    print(f"P95 latency: {scenario.p95_ms:.0f}ms")
    print(f"P99 latency: {scenario.p99_ms:.0f}ms")
    print(f"Max latency: {scenario.max_ms:.0f}ms")
    print(f"Throughput: {scenario.rps:.1f} req/s")
    print(f"Error rate: {scenario.error_rate:.2%}")

Backup and Recovery

flowchart TD
    A[Backup Strategy] --> B[PostgreSQL]
    A --> C[Vector Database]
    A --> D[Object Storage]

    B --> E[Daily Full Backup]
    B --> F[WAL Streaming]
    B --> G[Point-in-Time Recovery]

    C --> H[Collection Snapshots]
    C --> I[Incremental Backups]

    D --> J[Cross-Region Replication]

    E --> K[S3 / GCS Bucket]
    F --> K
    H --> K
    I --> K
    J --> L[Secondary Region]

    classDef strategy fill:#e1f5fe,stroke:#01579b
    classDef method fill:#f3e5f5,stroke:#4a148c
    classDef storage fill:#e8f5e8,stroke:#1b5e20

    class A strategy
    class B,C,D strategy
    class E,F,G,H,I,J method
    class K,L storage

Backup Configuration

from quivr.ops.backup import BackupManager

backup = BackupManager(
    storage_backend="s3",
    bucket="quivr-backups",
    region="us-east-1",
    encryption=True
)

# Schedule automated backups
backup.schedule(
    components={
        "postgresql": {
            "method": "pg_dump",
            "schedule": "0 2 * * *",      # Daily at 2 AM
            "retention_days": 30,
            "wal_archiving": True          # Continuous WAL shipping
        },
        "vector_db": {
            "method": "snapshot",
            "schedule": "0 3 * * *",      # Daily at 3 AM
            "retention_days": 14
        },
        "config": {
            "method": "file_copy",
            "schedule": "0 1 * * 0",      # Weekly on Sunday
            "retention_days": 90,
            "include": [".env.production", "nginx.conf", "docker-compose.prod.yml"]
        }
    }
)

# Manual backup
result = backup.run_now(component="postgresql")
print(f"Backup completed: {result.filename}")
print(f"Size: {result.size_mb:.1f} MB")
print(f"Duration: {result.duration_seconds:.1f}s")
print(f"Location: {result.storage_path}")

Disaster Recovery

from quivr.ops.recovery import RecoveryManager

recovery = RecoveryManager(
    backup_storage="s3://quivr-backups"
)

# List available backups
backups = recovery.list_backups(
    component="postgresql",
    limit=10
)

for b in backups:
    print(f"{b.timestamp}: {b.filename} ({b.size_mb:.1f} MB)")

# Restore from a specific backup
restore_result = recovery.restore(
    component="postgresql",
    backup_id=backups[0].id,
    target_database="quivr_restored",
    verify_integrity=True
)

print(f"Restore status: {restore_result.status}")
print(f"Tables restored: {restore_result.tables_count}")
print(f"Rows restored: {restore_result.rows_count:,}")
print(f"Duration: {restore_result.duration_seconds:.1f}s")

# Point-in-time recovery
pitr_result = recovery.point_in_time_restore(
    component="postgresql",
    target_time="2024-06-15T14:30:00Z",
    target_database="quivr_pitr"
)

Cost Management

Cost Breakdown and Optimization

Cost CategoryTypical RangeOptimization Strategy
LLM API calls40-60% of totalCache responses; use smaller models for simple queries
Embedding API10-20% of totalCache embeddings; use local models; incremental sync
Vector DB hosting10-15% of totalOptimize index params; archive old collections
Compute (API/Workers)10-20% of totalAuto-scale; right-size instances
Database hosting5-10% of totalConnection pooling; query optimization
Object storage1-5% of totalLifecycle policies; compress uploads
from quivr.ops.cost import CostDashboard

dashboard = CostDashboard(
    openai_api_key="your-key",
    cloud_provider="aws",
    aws_account_id="123456789"
)

# Get cost breakdown for the last 30 days
report = dashboard.generate_report(period_days=30)

print(f"Total cost: ${report.total_cost:.2f}")
print(f"\nBreakdown:")
print(f"  LLM API:      ${report.llm_cost:.2f} ({report.llm_pct:.0f}%)")
print(f"  Embeddings:   ${report.embedding_cost:.2f} ({report.embedding_pct:.0f}%)")
print(f"  Compute:      ${report.compute_cost:.2f} ({report.compute_pct:.0f}%)")
print(f"  Storage:       ${report.storage_cost:.2f} ({report.storage_pct:.0f}%)")
print(f"  Database:      ${report.database_cost:.2f} ({report.database_pct:.0f}%)")

print(f"\nOptimization suggestions:")
for suggestion in report.suggestions:
    print(f"  - {suggestion.description}")
    print(f"    Estimated savings: ${suggestion.monthly_savings:.2f}/month")

Go-Live Checklist

Pre-Launch Verification

from quivr.ops.preflight import PreflightChecker

checker = PreflightChecker(
    api_url="https://api.yourcompany.com",
    api_key="admin-key"
)

results = checker.run_all()

for check in results:
    status = "PASS" if check.passed else "FAIL"
    print(f"[{status}] {check.name}: {check.message}")

Comprehensive Checklist

CategoryItemStatus
Infrastructure
TLS certificates installed and auto-renewedRequired
DNS records configuredRequired
Load balancer health checks activeRequired
Auto-scaling policies configuredRecommended
Security
All secrets in secret manager (not env files)Required
API keys scoped and rotatedRequired
CORS origins restricted to production domainsRequired
Rate limiting enabled per user/keyRequired
WAF rules configured (if public-facing)Recommended
Vulnerability scan completedRequired
Data
Database backups scheduled and testedRequired
Vector database snapshots scheduledRequired
Point-in-time recovery testedRecommended
Data retention policies configuredRequired
Observability
Structured logging to centralized systemRequired
Prometheus metrics exposedRequired
Grafana dashboards configuredRecommended
Alert rules for error rate, latency, storageRequired
Error tracking (Sentry) configuredRecommended
Performance
Load test completed at 2x expected trafficRequired
P95 latency under 3 secondsRequired
Database connection pooling configuredRequired
Caching layer (Redis) configuredRecommended
Operations
Runbook for common incidents documentedRequired
On-call rotation establishedRecommended
Rollback procedure testedRequired
Deployment pipeline (CI/CD) configuredRequired

Troubleshooting

ProblemCauseSolution
Slow queries under loadInsufficient API workersIncrease WORKERS; add pod replicas
Memory pressure on vector DBIndex too large for RAMIncrease memory limit; enable disk-based index
Ingestion queue growingWorkers cannot keep upAdd worker replicas; increase CELERY_CONCURRENCY
502 Bad GatewayAPI pod crashed or restartingCheck pod logs; increase resource limits
Database connection exhaustedPool size too smallIncrease DB_POOL_SIZE; check for connection leaks
High LLM costsNo caching; unnecessary re-queriesEnable query cache; reduce max_tokens
Backup failuresInsufficient storage or permissionsCheck S3 bucket permissions; clean old backups
Certificate expirationAuto-renewal not configuredUse certbot with auto-renewal; set alerts

Summary

Production deployment transforms Quivr from a development tool into an enterprise service. In this chapter you learned:

  • Docker Compose configuration for production with health checks, resource limits, and networking
  • Kubernetes deployment with auto-scaling, rolling updates, and pod management
  • Security Hardening with TLS, OAuth2, RBAC, rate limiting, and input validation
  • Observability with Prometheus metrics, structured logging, and alerting rules
  • Performance Tuning with connection pooling, caching, index optimization, and load testing
  • Backup and Recovery with automated backups, WAL streaming, and point-in-time recovery
  • Cost Management with breakdown analysis and optimization strategies
  • Go-Live Checklist covering infrastructure, security, data, observability, and operations

Key Takeaways

  1. Start with Docker Compose, graduate to Kubernetes -- Docker Compose handles most deployments; Kubernetes is for auto-scaling and multi-region.
  2. Security is not optional -- TLS, scoped API keys, rate limiting, and audit logging are table stakes for production.
  3. Monitor everything -- you cannot optimize what you cannot measure. Track latency, error rates, costs, and queue depths.
  4. Test your backups -- a backup that has never been restored is not a backup. Run recovery drills quarterly.
  5. Plan for cost growth -- LLM costs scale with usage. Cache aggressively and use the smallest model that meets your quality bar.

Built with insights from the Quivr project.

What Problem Does This Solve?

Most teams struggle here because the hard part is not writing more code, but deciding clear boundaries for quivr, print, name so behavior stays predictable as complexity grows.

In practical terms, this chapter helps you avoid three common failures:

  • coupling core logic too tightly to one implementation path
  • missing the handoff boundaries between setup, execution, and validation
  • shipping changes without clear rollback or observability strategy

After working through this chapter, you should be able to reason about Chapter 8: Production Deployment as an operating subsystem inside Quivr Tutorial: Open-Source RAG Framework for Document Ingestion, with explicit contracts for inputs, state transitions, and outputs.

Use the implementation notes around report, classDef, fill as your checklist when adapting these patterns to your own repository.

How it Works Under the Hood

Under the hood, Chapter 8: Production Deployment usually follows a repeatable control path:

  1. Context bootstrap: initialize runtime config and prerequisites for quivr.
  2. Input normalization: shape incoming data so print receives stable contracts.
  3. Core execution: run the main logic branch and propagate intermediate state through name.
  4. Policy and safety checks: enforce limits, auth scopes, and failure boundaries.
  5. Output composition: return canonical result payloads for downstream consumers.
  6. Operational telemetry: emit logs/metrics needed for debugging and performance tuning.

When debugging, walk this sequence in order and confirm each stage has explicit success/failure conditions.

Source Walkthrough

Use the following upstream sources to verify implementation details while reading this chapter:

  • View Repo Why it matters: authoritative reference on View Repo (github.com).
  • AI Codebase Knowledge Builder Why it matters: authoritative reference on AI Codebase Knowledge Builder (github.com).

Suggested trace strategy:

  • search upstream code for quivr and print to map concrete implementation paths
  • compare docs claims against actual runtime/config code before reusing patterns in production

Chapter Connections