RAG Evaluation Metrics - Complete Guide
April 2, 2026 ยท View on GitHub
๐ฏ Overview
Dingo's RAG evaluation metrics system is based on best practices from the RAGAS paper, DeepEval, and TruLens, providing comprehensive RAG system evaluation capabilities.
โ Supported Metrics (5/5)
| Metric | Evaluation Dimension | Required Fields | Source |
|---|---|---|---|
| Faithfulness | Answer Faithfulness | user_input, response, retrieved_contexts | RAGAS |
| Answer Relevancy | Answer Relevance | user_input, response | RAGAS |
| Context Relevancy | Context Relevance | user_input, retrieved_contexts | RAGAS + DeepEval + TruLens |
| Context Recall | Context Recall | user_input, retrieved_contexts, reference | RAGAS |
| Context Precision | Context Precision | user_input, retrieved_contexts, reference | RAGAS |
๐ Quick Start
1. Run Examples
# Dataset mode - batch evaluation (recommended)
python examples/rag/dataset_rag_eval_baseline.py
# SDK mode - single evaluation
python examples/rag/sdk_rag_eval.py
# Simulate RAG system and evaluate
python examples/rag/e2e_RAG_eval_with_mockRAG_fiqa.py
2. SDK Mode - Single Evaluation
import os
from dingo.config.input_args import EvaluatorLLMArgs, EmbeddingConfigArgs
from dingo.io.input import Data
from dingo.model.llm.rag.llm_rag_faithfulness import LLMRAGFaithfulness
# Configure LLM
LLMRAGFaithfulness.dynamic_config = EvaluatorLLMArgs(
key=os.getenv("OPENAI_API_KEY"),
api_url=os.getenv("OPENAI_BASE_URL", "https://api.openai.com/v1"),
model=os.getenv("OPENAI_MODEL", "deepseek-chat"),
)
# Prepare data
data = Data(
data_id="example_1",
prompt="What is machine learning?",
content="Machine learning is a branch of AI that enables computers to learn from data.",
context=[
"Machine learning is a subfield of AI.",
"ML systems learn from data without explicit programming."
]
)
# Evaluate
result = LLMRAGFaithfulness.eval(data)
# View results
print(f"Score: {result.score}/10")
print(f"Passed: {not result.status}") # status=False means passed
print(f"Reason: {result.reason[0]}")
3. Dataset Mode - Batch Evaluation
from dingo.config import InputArgs
from dingo.exec import Executor
# Configuration
llm_config = {
"model": "gpt-4o-mini",
"key": "YOUR_API_KEY",
"api_url": "https://api.openai.com/v1",
}
llm_config_embedding = {
"model": "gpt-4o-mini",
"key": "YOUR_API_KEY",
"api_url": "https://api.openai.com/v1",
"embedding_config": { # โญ Required for Answer Relevancy
"model": "text-embedding-3-large",
"api_url": "https://api.openai.com/v1",
"key": "YOUR_API_KEY"
},
"strictness": 3,
"threshold": 5
}
input_data = {
"task_name": "rag_evaluation",
"input_path": "test/data/fiqa.jsonl",
"output_path": "outputs/",
"dataset": {"source": "local", "format": "jsonl"},
"executor": {
"max_workers": 10,
"result_save": {"good": True, "bad": True, "all_labels": True}
},
"evaluator": [
{
"fields": {
"prompt": "user_input",
"content": "response",
"context": "retrieved_contexts",
"reference": "reference"
},
"evals": [
{"name": "LLMRAGFaithfulness", "config": llm_config},
{"name": "LLMRAGAnswerRelevancy", "config": llm_config_embedding},
{"name": "LLMRAGContextRelevancy", "config": llm_config},
{"name": "LLMRAGContextRecall", "config": llm_config},
{"name": "LLMRAGContextPrecision", "config": llm_config}
]
}
]
}
input_args = InputArgs(**input_data)
executor = Executor.exec_map["local"](input_args)
summary = executor.execute()
๐ Data Format
Required Fields
| Metric | user_input | response | retrieved_contexts | reference | Notes |
|---|---|---|---|---|---|
| Faithfulness | โ | โ | โ | - | Measures if answer is based on context |
| Answer Relevancy | โ | โ | - | - | Measures if answer addresses the question |
| Context Relevancy | โ | - | โ | - | Measures if retrieved contexts are relevant |
| Context Recall | โ | - | โ | โ | Measures if all needed info is retrieved |
| Context Precision | โ | - | โ | โ | Measures ranking quality of retrieved contexts |
Data Example (JSONL)
{"user_input": "What is deep learning?", "response": "Deep learning uses neural networks...", "retrieved_contexts": ["Deep learning is a subset of ML...", "Deep learning is used for image recognition..."]}
{"user_input": "Python features?", "response": "Python is concise and has rich libraries.", "retrieved_contexts": ["Python has clean syntax.", "Python has NumPy and other libraries."], "reference": "Python has clean syntax and a rich ecosystem."}
โ๏ธ Configuration
Configurable Parameters
| Parameter | Applicable Metrics | Default | Description |
|---|---|---|---|
threshold | All metrics | 5.0 | Pass threshold (0-10) |
strictness | Answer Relevancy | 3 | Number of questions to generate (1-5) |
embedding_config | Answer Relevancy | - | Required: includes model, api_url, key |
Embedding Configuration (Answer Relevancy)
LLMRAGAnswerRelevancy requires embedding_config:
Option 1: Cloud LLM + Cloud Embedding
"config": {
"model": "deepseek-chat",
"key": "YOUR_API_KEY",
"api_url": "https://api.deepseek.com",
"embedding_config": { # โญ Required
"model": "text-embedding-3-large",
"api_url": "https://api.deepseek.com",
"key": "YOUR_API_KEY"
},
"strictness": 3,
"threshold": 5
}
Option 2: Cloud LLM + Local Embedding (Recommended: Cost-effective)
"config": {
"model": "deepseek-chat",
"key": "YOUR_API_KEY",
"api_url": "https://api.deepseek.com",
"embedding_config": { # โญ Independent embedding service
"model": "BAAI/bge-m3",
"api_url": "http://localhost:8000/v1", # Local vLLM/Xinference
"key": "dummy-key"
},
"strictness": 3,
"threshold": 5
}
Deploy Local Embedding (vLLM):
pip install vllm
python -m vllm.entrypoints.openai.api_server \
--model BAAI/bge-m3 \
--port 8000 \
--host 0.0.0.0
What happens if not configured?
Runtime exception:
ValueError: Embedding model not initialized. Please configure 'embedding_config' in your LLM config with:
- model: embedding model name (e.g., 'BAAI/bge-m3')
- api_url: embedding service URL
- key: API key (optional for local services)
๐ Metric Details
1๏ธโฃ Faithfulness (Answer Faithfulness)
Evaluation Goal: Measure if the answer is entirely based on retrieved context, avoiding hallucinations
Calculation:
- Break down answer into independent statements (claims)
- Judge if each statement is supported by context
- Faithfulness score = (Supported statements / Total statements) ร 10
Formula:
$ \text{Faithfulness} = (\text{Context}-\text{supported} \text{claims} / \text{Total} \text{claims}) \times 10 $
Recommended Threshold: 7 (out of 10)
2๏ธโฃ Answer Relevancy (Answer Relevance)
Evaluation Goal: Measure if the answer directly addresses the user question
Calculation:
- Generate N reverse questions from the answer (questions inferred by LLM from the answer)
- Calculate cosine similarity between embeddings of generated questions and original question
- Answer Relevancy = Average of all similarities
Formula: ``$ \text{Answer} \text{Relevancy} = (1/\text{N}) \times ฮฃ \text{cosine_sim}(\text{E_gi}, \text{E_o})
\text{Where}:
- \text{N}: \text{Number} \text{of} \text{generated} \text{questions}, \text{default} 3 (\text{adjustable} \text{via} \text{strictness} \text{parameter})
- \text{E_gi}: \text{Embedding} \text{of} \text{the} \text{i}-\text{th} \text{generated} \text{question}
- \text{E_o}: \text{Embedding} \text{of} \text{the} \text{original} \text{question} $``
โ ๏ธ Important: This metric requires embedding_config:
model: Embedding model name (e.g.,text-embedding-3-large,BAAI/bge-m3)api_url: Embedding service addresskey: API key (optional for local services)
Recommended Threshold: 5 (out of 10)
3๏ธโฃ Context Relevancy (Context Relevance)
Evaluation Goal: Measure if retrieved contexts are relevant to the question
Calculation: Uses a Dual-Judge System from NVIDIA research:
Judge 1 Scoring:
- 0 = Context completely irrelevant
- 1 = Context partially relevant
- 2 = Context fully relevant
Judge 2 Scoring:
- Uses different prompt wording for another perspective
- Same 0-2 scoring standard
- Purpose: Reduce single-prompt bias
Final Score:
Context Relevancy = (Relevant contexts / Total contexts) ร 10
Where:
- Relevant context: Average score from both judges โฅ threshold (default 1.0)
- Irrelevant context: Average score < threshold
Recommended Threshold: 5 (out of 10)
4๏ธโฃ Context Recall (Context Recall)
Evaluation Goal: Measure if all needed information is retrieved (requires reference answer)
Calculation:
- Extract independent statements from reference answer
- Judge if each statement can be attributed from retrieved contexts
- Recall = (Context-supported reference statements / Total reference statements) ร 10
Formula:
$ \text{Context} \text{Recall} = (\text{Context}-\text{supported} \text{reference} \text{claims} / \text{Total} \text{reference} \text{claims}) \times 10 $
Note: Requires reference answer (reference), typically used in evaluation phase
Recommended Threshold: 5 (out of 10)
5๏ธโฃ Context Precision (Context Precision)
Evaluation Goal: Measure ranking quality of retrieval results, whether relevant docs are at the top (requires reference answer)
Calculation:
- For each position k, judge if the context is relevant (supports reference answer)
- Calculate Precision@k for each position
- Use relevance indicator (v_k) for weighted sum
Formula:
Context Precision = ฮฃ(Precision@k ร v_k) / Total relevant items in top K
Where:
- K: Total retrieved documents, e.g., 5 documents
- k: Current position (1st, 2nd, 3rd, ..., K-th)
- v_k: Relevance indicator, 0 (irrelevant) or 1 (relevant)
- Precision@k: Precision in first k documents, 0.0 to 1.0
- Precision@k = Relevant count in first k / k
Note: Requires reference answer (reference) to judge which contexts are relevant
Recommended Threshold: 5 (out of 10)
๐ Best Practices
1. Metric Combinations
Complete Evaluation (5 metrics):
"evals": [
{"name": "LLMRAGFaithfulness"}, # Detect hallucinations
{"name": "LLMRAGAnswerRelevancy"}, # Check answer relevance
{"name": "LLMRAGContextRelevancy"}, # Check context noise
{"name": "LLMRAGContextRecall"}, # Evaluate retrieval completeness
{"name": "LLMRAGContextPrecision"} # Evaluate retrieval ranking
]
Production Environment (no reference needed):
"evals": [
{"name": "LLMRAGFaithfulness"}, # โญ Most important: prevent hallucinations
{"name": "LLMRAGAnswerRelevancy"}, # Ensure direct answers
{"name": "LLMRAGContextRelevancy"} # Check retrieval noise
]
Evaluation Phase (requires reference):
"evals": [
{"name": "LLMRAGContextRecall"}, # Evaluate retrieval completeness
{"name": "LLMRAGContextPrecision"} # Evaluate retrieval ranking
]
2. Threshold Adjustment
Adjust thresholds (default 5) based on scenario:
- Strict scenarios (finance, medical): threshold 7-8
- General scenarios (Q&A systems): threshold 5-6
- Loose scenarios (exploratory search): threshold 3-4
3. Iterative Optimization
- Initial Evaluation: Evaluate current system with all 5 metrics
- Identify Issues:
- Low Faithfulness โ Generation model produces hallucinations
- Optimize: Adjust generation prompts, use stronger models, enhance fact-checking
- Low Answer Relevancy โ Answer off-topic or contains irrelevant info
- Optimize: Improve generation prompts, limit answer length, enhance question understanding
- Low Context Relevancy โ Retrieval introduces noise
- Optimize: Improve retrieval algorithm, adjust similarity threshold, improve embedding model
- Low Context Recall โ Retrieval misses important info
- Optimize: Increase Top-K, improve query rewriting, expand knowledge base
- Low Context Precision โ Relevant docs ranked lower
- Optimize: Improve ranking algorithm, adjust reranker, improve relevance calculation
- Low Faithfulness โ Generation model produces hallucinations
- Targeted Optimization: Adjust components based on issues
- Re-evaluate: Verify optimization effects
- Continuous Monitoring: Monitor key metrics in production
4. Important Notes
- LLM Dependency: All metrics depend on LLM API, requiring correct API key and endpoint
- Embedding Dependency:
- Answer Relevancy requires
embedding_config:model,api_url,key - Can use cloud services (OpenAI, DeepSeek) or local deployment (vLLM, Xinference)
- Not configuring will throw exception:
ValueError: Embedding model not initialized...
- Answer Relevancy requires
- Cost Considerations: Evaluation generates API costs, recommendations:
- Development: Sample evaluation (50-100 samples)
- Production: Use key metrics only (Faithfulness, Answer Relevancy, Context Relevancy)
- Evaluation: Full evaluation of all metrics
- Reference Requirements:
- Context Recall and Context Precision require reference
- Other three metrics don't need reference
- Reference mainly used in evaluation phase, production usually doesn't need it
๐ For More Details
See the Chinese version for comprehensive examples and detailed explanations.