RAG Evaluation Metrics - Complete Guide

April 2, 2026 ยท View on GitHub

๐ŸŽฏ Overview

Dingo's RAG evaluation metrics system is based on best practices from the RAGAS paper, DeepEval, and TruLens, providing comprehensive RAG system evaluation capabilities.

โœ… Supported Metrics (5/5)

MetricEvaluation DimensionRequired FieldsSource
FaithfulnessAnswer Faithfulnessuser_input, response, retrieved_contextsRAGAS
Answer RelevancyAnswer Relevanceuser_input, responseRAGAS
Context RelevancyContext Relevanceuser_input, retrieved_contextsRAGAS + DeepEval + TruLens
Context RecallContext Recalluser_input, retrieved_contexts, referenceRAGAS
Context PrecisionContext Precisionuser_input, retrieved_contexts, referenceRAGAS

๐Ÿš€ Quick Start

1. Run Examples

# Dataset mode - batch evaluation (recommended)
python examples/rag/dataset_rag_eval_baseline.py

# SDK mode - single evaluation
python examples/rag/sdk_rag_eval.py

# Simulate RAG system and evaluate
python examples/rag/e2e_RAG_eval_with_mockRAG_fiqa.py

2. SDK Mode - Single Evaluation

import os
from dingo.config.input_args import EvaluatorLLMArgs, EmbeddingConfigArgs
from dingo.io.input import Data
from dingo.model.llm.rag.llm_rag_faithfulness import LLMRAGFaithfulness

# Configure LLM
LLMRAGFaithfulness.dynamic_config = EvaluatorLLMArgs(
    key=os.getenv("OPENAI_API_KEY"),
    api_url=os.getenv("OPENAI_BASE_URL", "https://api.openai.com/v1"),
    model=os.getenv("OPENAI_MODEL", "deepseek-chat"),
)

# Prepare data
data = Data(
    data_id="example_1",
    prompt="What is machine learning?",
    content="Machine learning is a branch of AI that enables computers to learn from data.",
    context=[
        "Machine learning is a subfield of AI.",
        "ML systems learn from data without explicit programming."
    ]
)

# Evaluate
result = LLMRAGFaithfulness.eval(data)

# View results
print(f"Score: {result.score}/10")
print(f"Passed: {not result.status}")  # status=False means passed
print(f"Reason: {result.reason[0]}")

3. Dataset Mode - Batch Evaluation

from dingo.config import InputArgs
from dingo.exec import Executor

# Configuration
llm_config = {
    "model": "gpt-4o-mini",
    "key": "YOUR_API_KEY",
    "api_url": "https://api.openai.com/v1",
}

llm_config_embedding = {
    "model": "gpt-4o-mini",
    "key": "YOUR_API_KEY",
    "api_url": "https://api.openai.com/v1",
    "embedding_config": {  # โญ Required for Answer Relevancy
        "model": "text-embedding-3-large",
        "api_url": "https://api.openai.com/v1",
        "key": "YOUR_API_KEY"
    },
    "strictness": 3,
    "threshold": 5
}

input_data = {
    "task_name": "rag_evaluation",
    "input_path": "test/data/fiqa.jsonl",
    "output_path": "outputs/",
    "dataset": {"source": "local", "format": "jsonl"},
    "executor": {
        "max_workers": 10,
        "result_save": {"good": True, "bad": True, "all_labels": True}
    },
    "evaluator": [
        {
            "fields": {
                "prompt": "user_input",
                "content": "response",
                "context": "retrieved_contexts",
                "reference": "reference"
            },
            "evals": [
                {"name": "LLMRAGFaithfulness", "config": llm_config},
                {"name": "LLMRAGAnswerRelevancy", "config": llm_config_embedding},
                {"name": "LLMRAGContextRelevancy", "config": llm_config},
                {"name": "LLMRAGContextRecall", "config": llm_config},
                {"name": "LLMRAGContextPrecision", "config": llm_config}
            ]
        }
    ]
}

input_args = InputArgs(**input_data)
executor = Executor.exec_map["local"](input_args)
summary = executor.execute()

๐Ÿ“‹ Data Format

Required Fields

Metricuser_inputresponseretrieved_contextsreferenceNotes
Faithfulnessโœ…โœ…โœ…-Measures if answer is based on context
Answer Relevancyโœ…โœ…--Measures if answer addresses the question
Context Relevancyโœ…-โœ…-Measures if retrieved contexts are relevant
Context Recallโœ…-โœ…โœ…Measures if all needed info is retrieved
Context Precisionโœ…-โœ…โœ…Measures ranking quality of retrieved contexts

Data Example (JSONL)

{"user_input": "What is deep learning?", "response": "Deep learning uses neural networks...", "retrieved_contexts": ["Deep learning is a subset of ML...", "Deep learning is used for image recognition..."]}
{"user_input": "Python features?", "response": "Python is concise and has rich libraries.", "retrieved_contexts": ["Python has clean syntax.", "Python has NumPy and other libraries."], "reference": "Python has clean syntax and a rich ecosystem."}

โš™๏ธ Configuration

Configurable Parameters

ParameterApplicable MetricsDefaultDescription
thresholdAll metrics5.0Pass threshold (0-10)
strictnessAnswer Relevancy3Number of questions to generate (1-5)
embedding_configAnswer Relevancy-Required: includes model, api_url, key

Embedding Configuration (Answer Relevancy)

LLMRAGAnswerRelevancy requires embedding_config:

Option 1: Cloud LLM + Cloud Embedding

"config": {
    "model": "deepseek-chat",
    "key": "YOUR_API_KEY",
    "api_url": "https://api.deepseek.com",
    "embedding_config": {  # โญ Required
        "model": "text-embedding-3-large",
        "api_url": "https://api.deepseek.com",
        "key": "YOUR_API_KEY"
    },
    "strictness": 3,
    "threshold": 5
}

Option 2: Cloud LLM + Local Embedding (Recommended: Cost-effective)

"config": {
    "model": "deepseek-chat",
    "key": "YOUR_API_KEY",
    "api_url": "https://api.deepseek.com",
    "embedding_config": {  # โญ Independent embedding service
        "model": "BAAI/bge-m3",
        "api_url": "http://localhost:8000/v1",  # Local vLLM/Xinference
        "key": "dummy-key"
    },
    "strictness": 3,
    "threshold": 5
}

Deploy Local Embedding (vLLM):

pip install vllm
python -m vllm.entrypoints.openai.api_server \
  --model BAAI/bge-m3 \
  --port 8000 \
  --host 0.0.0.0

What happens if not configured?

Runtime exception:

ValueError: Embedding model not initialized. Please configure 'embedding_config' in your LLM config with:
  - model: embedding model name (e.g., 'BAAI/bge-m3')
  - api_url: embedding service URL
  - key: API key (optional for local services)

๐Ÿ“Š Metric Details

1๏ธโƒฃ Faithfulness (Answer Faithfulness)

Evaluation Goal: Measure if the answer is entirely based on retrieved context, avoiding hallucinations

Calculation:

  1. Break down answer into independent statements (claims)
  2. Judge if each statement is supported by context
  3. Faithfulness score = (Supported statements / Total statements) ร— 10

Formula: $ \text{Faithfulness} = (\text{Context}-\text{supported} \text{claims} / \text{Total} \text{claims}) \times 10 $

Recommended Threshold: 7 (out of 10)


2๏ธโƒฃ Answer Relevancy (Answer Relevance)

Evaluation Goal: Measure if the answer directly addresses the user question

Calculation:

  1. Generate N reverse questions from the answer (questions inferred by LLM from the answer)
  2. Calculate cosine similarity between embeddings of generated questions and original question
  3. Answer Relevancy = Average of all similarities

Formula: ``$ \text{Answer} \text{Relevancy} = (1/\text{N}) \times ฮฃ \text{cosine_sim}(\text{E_gi}, \text{E_o})

\text{Where}:

  • \text{N}: \text{Number} \text{of} \text{generated} \text{questions}, \text{default} 3 (\text{adjustable} \text{via} \text{strictness} \text{parameter})
  • \text{E_gi}: \text{Embedding} \text{of} \text{the} \text{i}-\text{th} \text{generated} \text{question}
  • \text{E_o}: \text{Embedding} \text{of} \text{the} \text{original} \text{question} $``

โš ๏ธ Important: This metric requires embedding_config:

  • model: Embedding model name (e.g., text-embedding-3-large, BAAI/bge-m3)
  • api_url: Embedding service address
  • key: API key (optional for local services)

Recommended Threshold: 5 (out of 10)


3๏ธโƒฃ Context Relevancy (Context Relevance)

Evaluation Goal: Measure if retrieved contexts are relevant to the question

Calculation: Uses a Dual-Judge System from NVIDIA research:

Judge 1 Scoring:

  • 0 = Context completely irrelevant
  • 1 = Context partially relevant
  • 2 = Context fully relevant

Judge 2 Scoring:

  • Uses different prompt wording for another perspective
  • Same 0-2 scoring standard
  • Purpose: Reduce single-prompt bias

Final Score:

Context Relevancy = (Relevant contexts / Total contexts) ร— 10

Where:
- Relevant context: Average score from both judges โ‰ฅ threshold (default 1.0)
- Irrelevant context: Average score < threshold

Recommended Threshold: 5 (out of 10)


4๏ธโƒฃ Context Recall (Context Recall)

Evaluation Goal: Measure if all needed information is retrieved (requires reference answer)

Calculation:

  1. Extract independent statements from reference answer
  2. Judge if each statement can be attributed from retrieved contexts
  3. Recall = (Context-supported reference statements / Total reference statements) ร— 10

Formula: $ \text{Context} \text{Recall} = (\text{Context}-\text{supported} \text{reference} \text{claims} / \text{Total} \text{reference} \text{claims}) \times 10 $

Note: Requires reference answer (reference), typically used in evaluation phase

Recommended Threshold: 5 (out of 10)


5๏ธโƒฃ Context Precision (Context Precision)

Evaluation Goal: Measure ranking quality of retrieval results, whether relevant docs are at the top (requires reference answer)

Calculation:

  1. For each position k, judge if the context is relevant (supports reference answer)
  2. Calculate Precision@k for each position
  3. Use relevance indicator (v_k) for weighted sum

Formula:

Context Precision = ฮฃ(Precision@k ร— v_k) / Total relevant items in top K

Where:
- K: Total retrieved documents, e.g., 5 documents
- k: Current position (1st, 2nd, 3rd, ..., K-th)
- v_k: Relevance indicator, 0 (irrelevant) or 1 (relevant)
- Precision@k: Precision in first k documents, 0.0 to 1.0
- Precision@k = Relevant count in first k / k

Note: Requires reference answer (reference) to judge which contexts are relevant

Recommended Threshold: 5 (out of 10)

๐ŸŒŸ Best Practices

1. Metric Combinations

Complete Evaluation (5 metrics):

"evals": [
    {"name": "LLMRAGFaithfulness"},       # Detect hallucinations
    {"name": "LLMRAGAnswerRelevancy"},    # Check answer relevance
    {"name": "LLMRAGContextRelevancy"},   # Check context noise
    {"name": "LLMRAGContextRecall"},      # Evaluate retrieval completeness
    {"name": "LLMRAGContextPrecision"}    # Evaluate retrieval ranking
]

Production Environment (no reference needed):

"evals": [
    {"name": "LLMRAGFaithfulness"},       # โญ Most important: prevent hallucinations
    {"name": "LLMRAGAnswerRelevancy"},    # Ensure direct answers
    {"name": "LLMRAGContextRelevancy"}    # Check retrieval noise
]

Evaluation Phase (requires reference):

"evals": [
    {"name": "LLMRAGContextRecall"},      # Evaluate retrieval completeness
    {"name": "LLMRAGContextPrecision"}    # Evaluate retrieval ranking
]

2. Threshold Adjustment

Adjust thresholds (default 5) based on scenario:

  • Strict scenarios (finance, medical): threshold 7-8
  • General scenarios (Q&A systems): threshold 5-6
  • Loose scenarios (exploratory search): threshold 3-4

3. Iterative Optimization

  1. Initial Evaluation: Evaluate current system with all 5 metrics
  2. Identify Issues:
    • Low Faithfulness โ†’ Generation model produces hallucinations
      • Optimize: Adjust generation prompts, use stronger models, enhance fact-checking
    • Low Answer Relevancy โ†’ Answer off-topic or contains irrelevant info
      • Optimize: Improve generation prompts, limit answer length, enhance question understanding
    • Low Context Relevancy โ†’ Retrieval introduces noise
      • Optimize: Improve retrieval algorithm, adjust similarity threshold, improve embedding model
    • Low Context Recall โ†’ Retrieval misses important info
      • Optimize: Increase Top-K, improve query rewriting, expand knowledge base
    • Low Context Precision โ†’ Relevant docs ranked lower
      • Optimize: Improve ranking algorithm, adjust reranker, improve relevance calculation
  3. Targeted Optimization: Adjust components based on issues
  4. Re-evaluate: Verify optimization effects
  5. Continuous Monitoring: Monitor key metrics in production

4. Important Notes

  • LLM Dependency: All metrics depend on LLM API, requiring correct API key and endpoint
  • Embedding Dependency:
    • Answer Relevancy requires embedding_config: model, api_url, key
    • Can use cloud services (OpenAI, DeepSeek) or local deployment (vLLM, Xinference)
    • Not configuring will throw exception: ValueError: Embedding model not initialized...
  • Cost Considerations: Evaluation generates API costs, recommendations:
    • Development: Sample evaluation (50-100 samples)
    • Production: Use key metrics only (Faithfulness, Answer Relevancy, Context Relevancy)
    • Evaluation: Full evaluation of all metrics
  • Reference Requirements:
    • Context Recall and Context Precision require reference
    • Other three metrics don't need reference
    • Reference mainly used in evaluation phase, production usually doesn't need it

๐Ÿ“– For More Details

See the Chinese version for comprehensive examples and detailed explanations.