Open RAG Eval Metrics Documentation
December 13, 2025 · View on GitHub
This document provides detailed documentation for all evaluation metrics implemented in Open RAG Eval, particularly focusing on the TREC RAG Track metrics. These metrics are designed to evaluate RAG systems without requiring golden answers or golden chunks.
Table of Contents
- Overview
- Retrieval Metrics
- Generation Metrics
- Golden Answer Metrics
- Consistency Metrics
- Implementation Details
- References
Overview
Open RAG Eval implements state-of-the-art metrics from the TREC 2024 RAG Track, designed to evaluate both retrieval and generation components of RAG systems. The key innovation is that these metrics don't require pre-annotated golden answers or chunks, making evaluation scalable and practical for real-world applications.
Retrieval Metrics
UMBRELA
UMBRELA (UMbrela is the Bing RELevance Assessor) is an open-source reproduction of Microsoft Bing's relevance assessment methodology using LLMs.
Background
- Paper: arXiv:2406.06519
- Purpose: Automate relevance assessment for retrieved passages using LLMs, replacing expensive human judgments
- LLM Required: Configurable via LLMJudgeModel (default: OpenAI GPT-4o)
Inputs
This metric evaluates the relevance of retrieved passages and requires:
- Query: The user's search question or information need
- Retrieved Passages: A collection of text passages returned by the retrieval system (typically a dictionary mapping passage IDs to passage text)
- K Values: List of integers for calculating Precision@K metrics (e.g., [1, 3, 5])
Scoring System
UMBRELA assigns scores on a 0-3 scale:
- Score 0: Passage has nothing to do with the query
- Score 1: Passage seems related to the query but does not answer it
- Score 2: Passage has some answer for the query, but may be unclear or hidden amongst extraneous information
- Score 3: Passage is dedicated to the query and contains the exact answer
Implementation
The metric uses a structured prompting approach that includes:
- Considering the underlying intent of the search
- Measuring content-intent match (M)
- Measuring passage trustworthiness (T)
- Deciding on a final score (O)
Note: The implementation uses two different prompts:
- Standard prompt for most models
- Modified prompt for GPT-OSS and Qwen models with clearer formatting
Configuration
# Default model kwargs for UMBRELA
model_kwargs = {
"temperature": 0.0,
"top_p": 1.0,
"presence_penalty": 0.5,
"frequency_penalty": 0.0,
"seed": 42
}
Traditional Retrieval Metrics
Based on UMBRELA scores (with relevance threshold ≥ 2), the system also calculates:
Inputs
These metrics are derived metrics that require:
- UMBRELA Scores: The relevance scores (0-3) assigned to each retrieved passage
- Relevance Threshold: Score threshold to determine binary relevance (default: ≥ 2 is considered relevant)
Precision@K
Measures the fraction of relevant documents in the top K results:
Precision@K = (Number of relevant docs in top K) / K
Average Precision (AP@K)
Calculates the average of precision values at each relevant document position:
$ \text{AP}@\text{K} = (1/\text{num\_relevant}) \times Σ(\text{Precision}@\text{i}) \text{for} \text{each} \text{i} \text{where} \text{doc\_i} \text{is} \text{relevant} $
This metric rewards systems that rank relevant documents higher in the result list.
Mean Reciprocal Rank (MRR)
Measures the reciprocal of the rank at which the first relevant document is found:
MRR = 1 / (rank of first relevant document)
Generation Metrics
AutoNuggetizer
AutoNuggetizer is a framework for evaluating generated answers by automatically creating and assigning "nuggets" - atomic units of information.
Background
- Paper: arXiv:2411.09607
- Purpose: Evaluate RAG-generated answers by measuring information coverage
- Origin: Based on the nugget evaluation methodology from TREC Question Answering Track (2003)
- LLM Required: Configurable via LLMJudgeModel (default: OpenAI GPT-4o)
Inputs
This metric evaluates answer quality and requires:
- Query: The original user question
- Retrieved Passages: The text passages retrieved by the system
- UMBRELA Scores: Relevance scores from the UMBRELA metric (used to filter passages - only those with score ≥ 1 are used for nugget creation)
- Generated Answer: The complete answer produced by the generation system
Process
-
Nugget Creation:
- Inputs: Query + Retrieved passages (filtered by UMBRELA scores ≥ 1)
- Iteratively extracts atomic information units (1-12 words) from retrieved passages
- Maximum 30 nuggets created per query
- Runs up to 5 iterations to refine nugget list (default)
- Returns top 20 nuggets after importance scoring
-
Nugget Importance Scoring:
- Inputs: Query + List of created nuggets
- Each Nugget is classified into one of these two categories:
- Vital: Must be present in a good answer
- Okay: Worthwhile but not essential information
- Nuggets are processed in batches of 10 for LLM scoring
-
Nugget Assignment:
- Inputs: Query + Generated answer + Scored nuggets
- Each Nugget is assigned into one of these three categories:
- Support: Nugget fully captured in generated answer (1.0 score)
- Partial Support: Nugget partially captured (0.5 score)
- Not Support: Nugget not captured (0.0 score)
- Assignments are also processed in batches of 10
Scoring Formulas
Multiple scores are calculated:
- All Score: Average of all nugget assignment scores
- Vital Score: Average of vital nugget assignment scores
- Weighted Score:
(Σ vital_scores + 0.5 × Σ okay_scores) / (num_vital + 0.5 × num_okay) - Strict Scores: Binary versions (only "support" = 1, others = 0)
Citation Metric
Evaluates whether generated statements are properly supported by their cited passages.
LLM Required: Configurable via LLMJudgeModel (default: OpenAI GPT-4o)
Inputs
This metric validates citation accuracy and requires:
- Generated Answer with Citations: The answer text split into parts, each part with its associated citation references (passage IDs)
- Retrieved Passages: The actual text content of the cited passages that the generated answer references
Scoring Levels
Default scores for each support level:
- Full Support (1.0): All information in the statement is supported by the citation
- Partial Support (0.5): Some parts supported, others missing
- No Support (0.0): Citation doesn't support any part of the statement
Metrics Calculated
- Weighted Precision: Sum of citation average scores / total citations
- Weighted Recall: Sum of part average scores / total parts
- F1 Score: Harmonic mean of precision and recall
Hallucination Detection
Uses the Vectara Hallucination Evaluation Model (HHEM) to detect hallucinations.
Inputs
This metric checks factual consistency and requires:
- Generated Answer: The complete text produced by the generation system
- Retrieved Passages: All source passages that were retrieved (used as the factual basis to check the answer against)
Implementation
- Model:
vectara/hallucination_evaluation_model(HuggingFace Transformers) - Processing: Concatenates source passages and generated answer
- Output: HHEM score between 0 and 1 (higher = more factually consistent, less hallucination)
- Max Input: 8192 characters (truncated if longer)
- CPU Usage: Limited to 2 threads
No-Answer Detection
Determines if the system attempted to answer the query or returned a "no answer" response.
LLM Required: Configurable via LLMJudgeModel (default: OpenAI GPT-4o)
Inputs
This metric evaluates answer attempts and requires:
- Query: The original user question
- Generated Answer: The complete answer text produced by the system
Classification
- Yes: System attempted to answer (even if incorrect)
- No: System indicated inability to answer or insufficient information
This metric is crucial for calculating the "Questions Answered" percentage in evaluation reports.
Golden Answer Metrics
When reference/golden answers are available, the GoldenAnswerEvaluator provides metrics to compare generated answers against expected answers. These metrics require an expected_answer column in your queries.csv file.
Semantic Similarity
Purpose: Measures direct semantic similarity between generated and golden answers using embeddings.
Embedding Model Required: Configurable (default: OpenAI text-embedding-3-large)
Inputs
- Generated Answer: The answer produced by the RAG system
- Expected Answer: The golden/reference answer
Process
- Embedding: Both answers are embedded using the configured embedding model
- Cosine Similarity: Direct cosine similarity between the two embeddings
Output
- semantic_similarity: Cosine similarity score (typically 0-1 for text embeddings, though mathematically can be -1 to 1). Higher = more similar.
Note: While cosine similarity mathematically ranges from -1 to 1, modern text embedding models (like OpenAI's text-embedding-3-large) typically produce values in the 0-1 range because embeddings tend to have non-negative components. Negative values (indicating semantic opposition) are rare but theoretically possible.
Factual Correctness
Purpose: Measures factual accuracy by decomposing answers into claims and using NLI to verify them.
LLM Required: Configurable via LLMJudgeModel (default: OpenAI GPT-4o-mini)
Inputs
- Generated Answer: The answer produced by the RAG system
- Expected Answer: The golden/reference answer
Process
- Claim Extraction: LLM extracts atomic factual claims from both answers
- Precision Verification: Each generated claim is verified against the expected answer using NLI
- Verdicts:
entailment,contradiction, orneutral
- Verdicts:
- Recall Verification: Each expected claim is verified against the generated answer
- Score Calculation:
- Precision = (entailed generated claims) / (total generated claims)
- Recall = (entailed expected claims) / (total expected claims)
- F1 = 2 × (Precision × Recall) / (Precision + Recall)
Output
- factual_correctness_precision: Fraction of generated claims supported by golden answer (0-1)
- factual_correctness_recall: Fraction of golden claims covered by generated answer (0-1)
- factual_correctness_f1: Harmonic mean of precision and recall (0-1)
- generated_claims: List of claims extracted from generated answer
- expected_claims: List of claims extracted from golden answer
Configuration
evaluator:
- type: "GoldenAnswerEvaluator"
model:
type: "OpenAIModel"
name: "gpt-4o-mini"
api_key: ${oc.env:OPENAI_API_KEY}
embedding_model:
type: "OpenAIEmbeddingModel"
name: "text-embedding-3-large"
api_key: ${oc.env:OPENAI_API_KEY}
options:
run_consistency: true
metrics_to_run_consistency:
- "semantic_similarity"
- "factual_correctness_f1"
Consistency Metrics
When run_consistency is enabled, the system evaluates multiple runs of the same query to measure consistency across answers. The consistency evaluator uses specialized similarity metrics.
Inputs
The consistency metrics require:
- Multiple Generated Answers: Several answer outputs from running the same query multiple times through the RAG system
- Original Metrics: The numeric scores from the primary evaluator (e.g., TREC metrics) for each run, used for statistical analysis
BERTScore Similarity
Purpose: Measures semantic similarity between generated answers using contextual embeddings.
Implementation:
- Model:
xlm-roberta-large(default) - Language: English (configurable)
- Baseline Rescaling: Enabled by default for better score calibration
- Calculation: Pairwise F1 scores between all answer combinations
- Output: List of similarity scores for each answer pair
ROUGE Score Similarity
Purpose: Measures lexical overlap between generated answers using n-gram statistics.
Implementation:
- Metrics: ROUGE-1, ROUGE-2, ROUGE-L
- Calculation: Pairwise scores between all answer combinations
- Output: Dictionary with precision, recall, and F1 scores for each ROUGE variant
Statistical Analysis
For numeric metrics from the primary evaluator (TREC), consistency analysis includes:
- Mean: Average score across runs
- Variance: Measure of score dispersion
- Coefficient of Variation: Normalized measure of consistency that combines both mean and variance (lower = better)
Configuration
evaluator:
name: "consistency"
options:
metrics:
- bert_score:
model_type: "xlm-roberta-large"
lang: "en"
rescale_with_baseline: true
- rouge_score: {}
Implementation Details
TREC Evaluator Architecture
The main TRECEvaluator class orchestrates all metrics:
class TRECEvaluator:
def __init__(self, model: LLMJudgeModel, options: dict):
self.retrieval_metric = UMBRELAMetric(model)
self.generation_metric = AutoNuggetMetric(model)
self.citation_metric = CitationMetric(model)
self.hallucination_metric = HallucinationMetric()
self.no_answer_metric = NoAnswerMetric(model)
Evaluation Pipeline
- Retrieval Evaluation: UMBRELA scores computed for all retrieved passages
- Generation Evaluation:
- AutoNuggetizer creates nuggets from high-scoring passages
- Nuggets assigned to generated answer
- Citation, hallucination, and no-answer checks performed
- Aggregation: Scores aggregated and saved to CSV
Configuration Options
TREC Evaluator:
evaluator:
name: "trecrag"
options:
k_values: [1, 3, 5] # K values for precision@K metrics
run_consistency: false # Enable consistency evaluation
metrics_to_run_consistency: [] # Specific metrics for consistency
Consistency Evaluator:
evaluator:
name: "consistency"
options:
metrics:
- bert_score:
model_type: "xlm-roberta-large"
lang: "en"
rescale_with_baseline: true
- rouge_score: {}
Output Format
Results are saved as CSV with columns including:
Single Run Columns:
query_id: Unique identifier for the queryquery: The actual query textretrieval_score_umbrela_scores: Per-passage UMBRELA scores (JSON)retrieval_score_mean_umbrela_score: Average UMBRELA scoregeneration_score_vital_nuggetizer_score: Vital nugget coveragegeneration_score_hallucination_score: HHEM factual consistency scoregeneration_score_citation_f1_score: Citation support F1 scoregeneration_score_no_answer_score: Answer attempt classification (JSON)
Multiple Runs (when consistency evaluation is enabled):
- Columns are prefixed with
run_1_,run_2_, etc. - Each run includes all the metrics above
Visualization
The toolkit provides visualization capabilities through the CLI:
Basic Usage:
open-rag-eval plot results.csv --evaluator trec
Compare Multiple Results:
open-rag-eval plot results1.csv results2.csv --evaluator trec --output-file comparison.png
Consistency Results:
open-rag-eval plot consistency_results.csv --evaluator consistency
Golden Answer Results:
open-rag-eval plot golden_answer_results.csv --evaluator golden_answer
Options:
--evaluator: Required. Specifytrec,consistency, orgolden_answerbased on the evaluator used--output-file: Output filename (default: metrics_comparison.png)--metrics-to-plot: Specific metrics to visualize (optional)
Features:
- Creates boxplots showing distribution of metrics across queries
- Supports comparing multiple CSV files (different configurations)
- Shows percentage of questions answered for TREC evaluator
- Displays mean and median values with confidence intervals
- Automatically saves plots as PNG files
References
-
Ronak Pradeep et al. "Initial Nugget Evaluation Results for the TREC 2024 RAG Track with the AutoNuggetizer Framework." arXiv:2411.09607, 2024.
-
Shivani Upadhyay et al. "UMBRELA: UMBRELA is the (Open-Source Reproduction of the) Bing RELevance Assessor." arXiv:2406.06519, 2024.
-
TREC 2024 RAG Track: https://trec-rag.github.io/
-
Vectara Hallucination Evaluation Model: https://huggingface.co/vectara/hallucination_evaluation_model