RAG Evaluation Metrics - Complete Guide

April 2, 2026 · View on GitHub

🎯 Overview

Dingo's RAG evaluation metrics system is based on best practices from the RAGAS paper, DeepEval, and TruLens, providing comprehensive RAG system evaluation capabilities.

✅ Supported Metrics (5/5)

Metric	Evaluation Dimension	Required Fields	Source
Faithfulness	Answer Faithfulness	user_input, response, retrieved_contexts	RAGAS
Answer Relevancy	Answer Relevance	user_input, response	RAGAS
Context Relevancy	Context Relevance	user_input, retrieved_contexts	RAGAS + DeepEval + TruLens
Context Recall	Context Recall	user_input, retrieved_contexts, reference	RAGAS
Context Precision	Context Precision	user_input, retrieved_contexts, reference	RAGAS

🚀 Quick Start

1. Run Examples

# Dataset mode - batch evaluation (recommended)
python examples/rag/dataset_rag_eval_baseline.py

# SDK mode - single evaluation
python examples/rag/sdk_rag_eval.py

# Simulate RAG system and evaluate
python examples/rag/e2e_RAG_eval_with_mockRAG_fiqa.py

2. SDK Mode - Single Evaluation

import os
from dingo.config.input_args import EvaluatorLLMArgs, EmbeddingConfigArgs
from dingo.io.input import Data
from dingo.model.llm.rag.llm_rag_faithfulness import LLMRAGFaithfulness

# Configure LLM
LLMRAGFaithfulness.dynamic_config = EvaluatorLLMArgs(
    key=os.getenv("OPENAI_API_KEY"),
    api_url=os.getenv("OPENAI_BASE_URL", "https://api.openai.com/v1"),
    model=os.getenv("OPENAI_MODEL", "deepseek-chat"),
)

# Prepare data
data = Data(
    data_id="example_1",
    prompt="What is machine learning?",
    content="Machine learning is a branch of AI that enables computers to learn from data.",
    context=[
        "Machine learning is a subfield of AI.",
        "ML systems learn from data without explicit programming."
    ]
)

# Evaluate
result = LLMRAGFaithfulness.eval(data)

# View results
print(f"Score: {result.score}/10")
print(f"Passed: {not result.status}")  # status=False means passed
print(f"Reason: {result.reason[0]}")

3. Dataset Mode - Batch Evaluation

from dingo.config import InputArgs
from dingo.exec import Executor

# Configuration
llm_config = {
    "model": "gpt-4o-mini",
    "key": "YOUR_API_KEY",
    "api_url": "https://api.openai.com/v1",
}

llm_config_embedding = {
    "model": "gpt-4o-mini",
    "key": "YOUR_API_KEY",
    "api_url": "https://api.openai.com/v1",
    "embedding_config": {  # ⭐ Required for Answer Relevancy
        "model": "text-embedding-3-large",
        "api_url": "https://api.openai.com/v1",
        "key": "YOUR_API_KEY"
    },
    "strictness": 3,
    "threshold": 5
}

input_data = {
    "task_name": "rag_evaluation",
    "input_path": "test/data/fiqa.jsonl",
    "output_path": "outputs/",
    "dataset": {"source": "local", "format": "jsonl"},
    "executor": {
        "max_workers": 10,
        "result_save": {"good": True, "bad": True, "all_labels": True}
    },
    "evaluator": [
        {
            "fields": {
                "prompt": "user_input",
                "content": "response",
                "context": "retrieved_contexts",
                "reference": "reference"
            },
            "evals": [
                {"name": "LLMRAGFaithfulness", "config": llm_config},
                {"name": "LLMRAGAnswerRelevancy", "config": llm_config_embedding},
                {"name": "LLMRAGContextRelevancy", "config": llm_config},
                {"name": "LLMRAGContextRecall", "config": llm_config},
                {"name": "LLMRAGContextPrecision", "config": llm_config}
            ]
        }
    ]
}

input_args = InputArgs(**input_data)
executor = Executor.exec_map["local"](input_args)
summary = executor.execute()

📋 Data Format

Required Fields

Metric	user_input	response	retrieved_contexts	reference	Notes
Faithfulness	✅	✅	✅	-	Measures if answer is based on context
Answer Relevancy	✅	✅	-	-	Measures if answer addresses the question
Context Relevancy	✅	-	✅	-	Measures if retrieved contexts are relevant
Context Recall	✅	-	✅	✅	Measures if all needed info is retrieved
Context Precision	✅	-	✅	✅	Measures ranking quality of retrieved contexts

Data Example (JSONL)

{"user_input": "What is deep learning?", "response": "Deep learning uses neural networks...", "retrieved_contexts": ["Deep learning is a subset of ML...", "Deep learning is used for image recognition..."]}
{"user_input": "Python features?", "response": "Python is concise and has rich libraries.", "retrieved_contexts": ["Python has clean syntax.", "Python has NumPy and other libraries."], "reference": "Python has clean syntax and a rich ecosystem."}

⚙️ Configuration

Configurable Parameters

Parameter	Applicable Metrics	Default	Description
`threshold`	All metrics	5.0	Pass threshold (0-10)
`strictness`	Answer Relevancy	3	Number of questions to generate (1-5)
`embedding_config`	Answer Relevancy	-	Required: includes `model`, `api_url`, `key`

Embedding Configuration (Answer Relevancy)

LLMRAGAnswerRelevancy requires embedding_config:

Option 1: Cloud LLM + Cloud Embedding

"config": {
    "model": "deepseek-chat",
    "key": "YOUR_API_KEY",
    "api_url": "https://api.deepseek.com",
    "embedding_config": {  # ⭐ Required
        "model": "text-embedding-3-large",
        "api_url": "https://api.deepseek.com",
        "key": "YOUR_API_KEY"
    },
    "strictness": 3,
    "threshold": 5
}

Option 2: Cloud LLM + Local Embedding (Recommended: Cost-effective)

"config": {
    "model": "deepseek-chat",
    "key": "YOUR_API_KEY",
    "api_url": "https://api.deepseek.com",
    "embedding_config": {  # ⭐ Independent embedding service
        "model": "BAAI/bge-m3",
        "api_url": "http://localhost:8000/v1",  # Local vLLM/Xinference
        "key": "dummy-key"
    },
    "strictness": 3,
    "threshold": 5
}

Deploy Local Embedding (vLLM):

pip install vllm
python -m vllm.entrypoints.openai.api_server \
  --model BAAI/bge-m3 \
  --port 8000 \
  --host 0.0.0.0

What happens if not configured?

Runtime exception:

ValueError: Embedding model not initialized. Please configure 'embedding_config' in your LLM config with:
  - model: embedding model name (e.g., 'BAAI/bge-m3')
  - api_url: embedding service URL
  - key: API key (optional for local services)

📊 Metric Details

1️⃣ Faithfulness (Answer Faithfulness)

Evaluation Goal: Measure if the answer is entirely based on retrieved context, avoiding hallucinations

Calculation:

Break down answer into independent statements (claims)
Judge if each statement is supported by context
Faithfulness score = (Supported statements / Total statements) × 10

Formula: $ \text{Faithfulness} = (\text{Context}-\text{supported} \text{claims} / \text{Total} \text{claims}) \times 10 $

Recommended Threshold: 7 (out of 10)

2️⃣ Answer Relevancy (Answer Relevance)

Evaluation Goal: Measure if the answer directly addresses the user question

Calculation:

Generate N reverse questions from the answer (questions inferred by LLM from the answer)
Calculate cosine similarity between embeddings of generated questions and original question
Answer Relevancy = Average of all similarities

Formula: ``$ \text{Answer} \text{Relevancy} = (1/\text{N}) \times Σ \text{cosine_sim}(\text{E_gi}, \text{E_o})

\text{Where}:

\text{N}: \text{Number} \text{of} \text{generated} \text{questions}, \text{default} 3 (\text{adjustable} \text{via} \text{strictness} \text{parameter})
\text{E_gi}: \text{Embedding} \text{of} \text{the} \text{i}-\text{th} \text{generated} \text{question}
\text{E_o}: \text{Embedding} \text{of} \text{the} \text{original} \text{question} $``

⚠️ Important: This metric requires embedding_config:

model: Embedding model name (e.g., text-embedding-3-large, BAAI/bge-m3)
api_url: Embedding service address
key: API key (optional for local services)

Recommended Threshold: 5 (out of 10)

3️⃣ Context Relevancy (Context Relevance)

Evaluation Goal: Measure if retrieved contexts are relevant to the question

Calculation: Uses a Dual-Judge System from NVIDIA research:

Judge 1 Scoring:

0 = Context completely irrelevant
1 = Context partially relevant
2 = Context fully relevant

Judge 2 Scoring:

Uses different prompt wording for another perspective
Same 0-2 scoring standard
Purpose: Reduce single-prompt bias

Final Score:

Context Relevancy = (Relevant contexts / Total contexts) × 10

Where:
- Relevant context: Average score from both judges ≥ threshold (default 1.0)
- Irrelevant context: Average score < threshold

Recommended Threshold: 5 (out of 10)

4️⃣ Context Recall (Context Recall)

Evaluation Goal: Measure if all needed information is retrieved (requires reference answer)

Calculation:

Extract independent statements from reference answer
Judge if each statement can be attributed from retrieved contexts
Recall = (Context-supported reference statements / Total reference statements) × 10

Formula: $ \text{Context} \text{Recall} = (\text{Context}-\text{supported} \text{reference} \text{claims} / \text{Total} \text{reference} \text{claims}) \times 10 $

Note: Requires reference answer (reference), typically used in evaluation phase

Recommended Threshold: 5 (out of 10)

5️⃣ Context Precision (Context Precision)

Evaluation Goal: Measure ranking quality of retrieval results, whether relevant docs are at the top (requires reference answer)

Calculation:

For each position k, judge if the context is relevant (supports reference answer)
Calculate Precision@k for each position
Use relevance indicator (v_k) for weighted sum

Formula:

Context Precision = Σ(Precision@k × v_k) / Total relevant items in top K

Where:
- K: Total retrieved documents, e.g., 5 documents
- k: Current position (1st, 2nd, 3rd, ..., K-th)
- v_k: Relevance indicator, 0 (irrelevant) or 1 (relevant)
- Precision@k: Precision in first k documents, 0.0 to 1.0
- Precision@k = Relevant count in first k / k

Note: Requires reference answer (reference) to judge which contexts are relevant

Recommended Threshold: 5 (out of 10)

🌟 Best Practices

1. Metric Combinations

Complete Evaluation (5 metrics):

"evals": [
    {"name": "LLMRAGFaithfulness"},       # Detect hallucinations
    {"name": "LLMRAGAnswerRelevancy"},    # Check answer relevance
    {"name": "LLMRAGContextRelevancy"},   # Check context noise
    {"name": "LLMRAGContextRecall"},      # Evaluate retrieval completeness
    {"name": "LLMRAGContextPrecision"}    # Evaluate retrieval ranking
]

Production Environment (no reference needed):

"evals": [
    {"name": "LLMRAGFaithfulness"},       # ⭐ Most important: prevent hallucinations
    {"name": "LLMRAGAnswerRelevancy"},    # Ensure direct answers
    {"name": "LLMRAGContextRelevancy"}    # Check retrieval noise
]

Evaluation Phase (requires reference):

"evals": [
    {"name": "LLMRAGContextRecall"},      # Evaluate retrieval completeness
    {"name": "LLMRAGContextPrecision"}    # Evaluate retrieval ranking
]

2. Threshold Adjustment

Adjust thresholds (default 5) based on scenario:

Strict scenarios (finance, medical): threshold 7-8
General scenarios (Q&A systems): threshold 5-6
Loose scenarios (exploratory search): threshold 3-4

3. Iterative Optimization

Initial Evaluation: Evaluate current system with all 5 metrics
Identify Issues:
- Low Faithfulness → Generation model produces hallucinations
  - Optimize: Adjust generation prompts, use stronger models, enhance fact-checking
- Low Answer Relevancy → Answer off-topic or contains irrelevant info
  - Optimize: Improve generation prompts, limit answer length, enhance question understanding
- Low Context Relevancy → Retrieval introduces noise
  - Optimize: Improve retrieval algorithm, adjust similarity threshold, improve embedding model
- Low Context Recall → Retrieval misses important info
  - Optimize: Increase Top-K, improve query rewriting, expand knowledge base
- Low Context Precision → Relevant docs ranked lower
  - Optimize: Improve ranking algorithm, adjust reranker, improve relevance calculation
Targeted Optimization: Adjust components based on issues
Re-evaluate: Verify optimization effects
Continuous Monitoring: Monitor key metrics in production

4. Important Notes

LLM Dependency: All metrics depend on LLM API, requiring correct API key and endpoint
Embedding Dependency:
- Answer Relevancy requires embedding_config: model, api_url, key
- Can use cloud services (OpenAI, DeepSeek) or local deployment (vLLM, Xinference)
- Not configuring will throw exception: ValueError: Embedding model not initialized...
Cost Considerations: Evaluation generates API costs, recommendations:
- Development: Sample evaluation (50-100 samples)
- Production: Use key metrics only (Faithfulness, Answer Relevancy, Context Relevancy)
- Evaluation: Full evaluation of all metrics
Reference Requirements:
- Context Recall and Context Precision require reference
- Other three metrics don't need reference
- Reference mainly used in evaluation phase, production usually doesn't need it

📖 For More Details

See the Chinese version for comprehensive examples and detailed explanations.