Open RAG Eval Metrics Documentation

December 13, 2025 · View on GitHub

This document provides detailed documentation for all evaluation metrics implemented in Open RAG Eval, particularly focusing on the TREC RAG Track metrics. These metrics are designed to evaluate RAG systems without requiring golden answers or golden chunks.

Overview
Retrieval Metrics
- UMBRELA
- Traditional Retrieval Metrics
Generation Metrics
Golden Answer Metrics
- Semantic Similarity
- Factual Correctness
Consistency Metrics
Implementation Details
References

Overview

Open RAG Eval implements state-of-the-art metrics from the TREC 2024 RAG Track, designed to evaluate both retrieval and generation components of RAG systems. The key innovation is that these metrics don't require pre-annotated golden answers or chunks, making evaluation scalable and practical for real-world applications.

Retrieval Metrics

UMBRELA

UMBRELA (UMbrela is the Bing RELevance Assessor) is an open-source reproduction of Microsoft Bing's relevance assessment methodology using LLMs.

Background

Paper: arXiv:2406.06519
Purpose: Automate relevance assessment for retrieved passages using LLMs, replacing expensive human judgments
LLM Required: Configurable via LLMJudgeModel (default: OpenAI GPT-4o)

Inputs

This metric evaluates the relevance of retrieved passages and requires:

Query: The user's search question or information need
Retrieved Passages: A collection of text passages returned by the retrieval system (typically a dictionary mapping passage IDs to passage text)
K Values: List of integers for calculating Precision@K metrics (e.g., [1, 3, 5])

Scoring System

UMBRELA assigns scores on a 0-3 scale:

Score 0: Passage has nothing to do with the query
Score 1: Passage seems related to the query but does not answer it
Score 2: Passage has some answer for the query, but may be unclear or hidden amongst extraneous information
Score 3: Passage is dedicated to the query and contains the exact answer

Implementation

The metric uses a structured prompting approach that includes:

Considering the underlying intent of the search
Measuring content-intent match (M)
Measuring passage trustworthiness (T)
Deciding on a final score (O)

Note: The implementation uses two different prompts:

Standard prompt for most models
Modified prompt for GPT-OSS and Qwen models with clearer formatting

Configuration

# Default model kwargs for UMBRELA
model_kwargs = {
    "temperature": 0.0,
    "top_p": 1.0,
    "presence_penalty": 0.5,
    "frequency_penalty": 0.0,
    "seed": 42
}

Traditional Retrieval Metrics

Based on UMBRELA scores (with relevance threshold ≥ 2), the system also calculates:

Inputs

These metrics are derived metrics that require:

UMBRELA Scores: The relevance scores (0-3) assigned to each retrieved passage
Relevance Threshold: Score threshold to determine binary relevance (default: ≥ 2 is considered relevant)

Precision@K

Measures the fraction of relevant documents in the top K results:

Precision@K = (Number of relevant docs in top K) / K

Average Precision (AP@K)

Calculates the average of precision values at each relevant document position: $ \text{AP}@\text{K} = (1/\text{num\_relevant}) \times Σ(\text{Precision}@\text{i}) \text{for} \text{each} \text{i} \text{where} \text{doc\_i} \text{is} \text{relevant} $ This metric rewards systems that rank relevant documents higher in the result list.

Mean Reciprocal Rank (MRR)

Measures the reciprocal of the rank at which the first relevant document is found:

MRR = 1 / (rank of first relevant document)

Generation Metrics

AutoNuggetizer

AutoNuggetizer is a framework for evaluating generated answers by automatically creating and assigning "nuggets" - atomic units of information.

Background

Paper: arXiv:2411.09607
Purpose: Evaluate RAG-generated answers by measuring information coverage
Origin: Based on the nugget evaluation methodology from TREC Question Answering Track (2003)
LLM Required: Configurable via LLMJudgeModel (default: OpenAI GPT-4o)

Inputs

This metric evaluates answer quality and requires:

Query: The original user question
Retrieved Passages: The text passages retrieved by the system
UMBRELA Scores: Relevance scores from the UMBRELA metric (used to filter passages - only those with score ≥ 1 are used for nugget creation)
Generated Answer: The complete answer produced by the generation system

Process

Nugget Creation:
- Inputs: Query + Retrieved passages (filtered by UMBRELA scores ≥ 1)
- Iteratively extracts atomic information units (1-12 words) from retrieved passages
- Maximum 30 nuggets created per query
- Runs up to 5 iterations to refine nugget list (default)
- Returns top 20 nuggets after importance scoring
Nugget Importance Scoring:
- Inputs: Query + List of created nuggets
- Each Nugget is classified into one of these two categories:
  - Vital: Must be present in a good answer
  - Okay: Worthwhile but not essential information
- Nuggets are processed in batches of 10 for LLM scoring
Nugget Assignment:
- Inputs: Query + Generated answer + Scored nuggets
- Each Nugget is assigned into one of these three categories:
  - Support: Nugget fully captured in generated answer (1.0 score)
  - Partial Support: Nugget partially captured (0.5 score)
  - Not Support: Nugget not captured (0.0 score)
- Assignments are also processed in batches of 10

Scoring Formulas

Multiple scores are calculated:

All Score: Average of all nugget assignment scores
Vital Score: Average of vital nugget assignment scores
Weighted Score: (Σ vital_scores + 0.5 × Σ okay_scores) / (num_vital + 0.5 × num_okay)
Strict Scores: Binary versions (only "support" = 1, others = 0)

Citation Metric

Evaluates whether generated statements are properly supported by their cited passages.

LLM Required: Configurable via LLMJudgeModel (default: OpenAI GPT-4o)

Inputs

This metric validates citation accuracy and requires:

Generated Answer with Citations: The answer text split into parts, each part with its associated citation references (passage IDs)
Retrieved Passages: The actual text content of the cited passages that the generated answer references

Scoring Levels

Default scores for each support level:

Full Support (1.0): All information in the statement is supported by the citation
Partial Support (0.5): Some parts supported, others missing
No Support (0.0): Citation doesn't support any part of the statement

Metrics Calculated

Weighted Precision: Sum of citation average scores / total citations
Weighted Recall: Sum of part average scores / total parts
F1 Score: Harmonic mean of precision and recall

Hallucination Detection

Uses the Vectara Hallucination Evaluation Model (HHEM) to detect hallucinations.

Inputs

This metric checks factual consistency and requires:

Generated Answer: The complete text produced by the generation system
Retrieved Passages: All source passages that were retrieved (used as the factual basis to check the answer against)

Implementation

Model: vectara/hallucination_evaluation_model (HuggingFace Transformers)
Processing: Concatenates source passages and generated answer
Output: HHEM score between 0 and 1 (higher = more factually consistent, less hallucination)
Max Input: 8192 characters (truncated if longer)
CPU Usage: Limited to 2 threads

No-Answer Detection

Determines if the system attempted to answer the query or returned a "no answer" response.

LLM Required: Configurable via LLMJudgeModel (default: OpenAI GPT-4o)

Inputs

This metric evaluates answer attempts and requires:

Query: The original user question
Generated Answer: The complete answer text produced by the system

Classification

Yes: System attempted to answer (even if incorrect)
No: System indicated inability to answer or insufficient information

This metric is crucial for calculating the "Questions Answered" percentage in evaluation reports.

Golden Answer Metrics

When reference/golden answers are available, the GoldenAnswerEvaluator provides metrics to compare generated answers against expected answers. These metrics require an expected_answer column in your queries.csv file.

Semantic Similarity

Purpose: Measures direct semantic similarity between generated and golden answers using embeddings.

Embedding Model Required: Configurable (default: OpenAI text-embedding-3-large)

Inputs

Generated Answer: The answer produced by the RAG system
Expected Answer: The golden/reference answer

Process

Embedding: Both answers are embedded using the configured embedding model
Cosine Similarity: Direct cosine similarity between the two embeddings

Output

semantic_similarity: Cosine similarity score (typically 0-1 for text embeddings, though mathematically can be -1 to 1). Higher = more similar.

Note: While cosine similarity mathematically ranges from -1 to 1, modern text embedding models (like OpenAI's text-embedding-3-large) typically produce values in the 0-1 range because embeddings tend to have non-negative components. Negative values (indicating semantic opposition) are rare but theoretically possible.

Factual Correctness

Purpose: Measures factual accuracy by decomposing answers into claims and using NLI to verify them.

LLM Required: Configurable via LLMJudgeModel (default: OpenAI GPT-4o-mini)

Inputs

Generated Answer: The answer produced by the RAG system
Expected Answer: The golden/reference answer

Process

Claim Extraction: LLM extracts atomic factual claims from both answers
Precision Verification: Each generated claim is verified against the expected answer using NLI
- Verdicts: entailment, contradiction, or neutral
Recall Verification: Each expected claim is verified against the generated answer
Score Calculation:
- Precision = (entailed generated claims) / (total generated claims)
- Recall = (entailed expected claims) / (total expected claims)
- F1 = 2 × (Precision × Recall) / (Precision + Recall)

Output

factual_correctness_precision: Fraction of generated claims supported by golden answer (0-1)
factual_correctness_recall: Fraction of golden claims covered by generated answer (0-1)
factual_correctness_f1: Harmonic mean of precision and recall (0-1)
generated_claims: List of claims extracted from generated answer
expected_claims: List of claims extracted from golden answer

Configuration

evaluator:
  - type: "GoldenAnswerEvaluator"
    model:
      type: "OpenAIModel"
      name: "gpt-4o-mini"
      api_key: ${oc.env:OPENAI_API_KEY}
    embedding_model:
      type: "OpenAIEmbeddingModel"
      name: "text-embedding-3-large"
      api_key: ${oc.env:OPENAI_API_KEY}
    options:
      run_consistency: true
      metrics_to_run_consistency:
        - "semantic_similarity"
        - "factual_correctness_f1"

Consistency Metrics

When run_consistency is enabled, the system evaluates multiple runs of the same query to measure consistency across answers. The consistency evaluator uses specialized similarity metrics.

Inputs

The consistency metrics require:

Multiple Generated Answers: Several answer outputs from running the same query multiple times through the RAG system
Original Metrics: The numeric scores from the primary evaluator (e.g., TREC metrics) for each run, used for statistical analysis

BERTScore Similarity

Purpose: Measures semantic similarity between generated answers using contextual embeddings.

Implementation:

Model: xlm-roberta-large (default)
Language: English (configurable)
Baseline Rescaling: Enabled by default for better score calibration
Calculation: Pairwise F1 scores between all answer combinations
Output: List of similarity scores for each answer pair

ROUGE Score Similarity

Purpose: Measures lexical overlap between generated answers using n-gram statistics.

Implementation:

Metrics: ROUGE-1, ROUGE-2, ROUGE-L
Calculation: Pairwise scores between all answer combinations
Output: Dictionary with precision, recall, and F1 scores for each ROUGE variant

Statistical Analysis

For numeric metrics from the primary evaluator (TREC), consistency analysis includes:

Mean: Average score across runs
Variance: Measure of score dispersion
Coefficient of Variation: Normalized measure of consistency that combines both mean and variance (lower = better)

Configuration

evaluator:
  name: "consistency"
  options:
    metrics:
      - bert_score:
          model_type: "xlm-roberta-large"
          lang: "en"
          rescale_with_baseline: true
      - rouge_score: {}

Implementation Details

TREC Evaluator Architecture

The main TRECEvaluator class orchestrates all metrics:

class TRECEvaluator:
    def __init__(self, model: LLMJudgeModel, options: dict):
        self.retrieval_metric = UMBRELAMetric(model)
        self.generation_metric = AutoNuggetMetric(model)
        self.citation_metric = CitationMetric(model)
        self.hallucination_metric = HallucinationMetric()
        self.no_answer_metric = NoAnswerMetric(model)

Evaluation Pipeline

Retrieval Evaluation: UMBRELA scores computed for all retrieved passages
Generation Evaluation:
- AutoNuggetizer creates nuggets from high-scoring passages
- Nuggets assigned to generated answer
- Citation, hallucination, and no-answer checks performed
Aggregation: Scores aggregated and saved to CSV

Configuration Options

TREC Evaluator:

evaluator:
  name: "trecrag"
  options:
    k_values: [1, 3, 5]  # K values for precision@K metrics
    run_consistency: false  # Enable consistency evaluation
    metrics_to_run_consistency: []  # Specific metrics for consistency

Consistency Evaluator:

evaluator:
  name: "consistency"
  options:
    metrics:
      - bert_score:
          model_type: "xlm-roberta-large"
          lang: "en"
          rescale_with_baseline: true
      - rouge_score: {}

Output Format

Results are saved as CSV with columns including:

Single Run Columns:

query_id: Unique identifier for the query
query: The actual query text
retrieval_score_umbrela_scores: Per-passage UMBRELA scores (JSON)
retrieval_score_mean_umbrela_score: Average UMBRELA score
generation_score_vital_nuggetizer_score: Vital nugget coverage
generation_score_hallucination_score: HHEM factual consistency score
generation_score_citation_f1_score: Citation support F1 score
generation_score_no_answer_score: Answer attempt classification (JSON)

Multiple Runs (when consistency evaluation is enabled):

Columns are prefixed with run_1_, run_2_, etc.
Each run includes all the metrics above

Visualization

The toolkit provides visualization capabilities through the CLI:

Basic Usage:

open-rag-eval plot results.csv --evaluator trec

Compare Multiple Results:

open-rag-eval plot results1.csv results2.csv --evaluator trec --output-file comparison.png

Consistency Results:

open-rag-eval plot consistency_results.csv --evaluator consistency

Golden Answer Results:

open-rag-eval plot golden_answer_results.csv --evaluator golden_answer

Options:

--evaluator: Required. Specify trec, consistency, or golden_answer based on the evaluator used
--output-file: Output filename (default: metrics_comparison.png)
--metrics-to-plot: Specific metrics to visualize (optional)

Features:

Creates boxplots showing distribution of metrics across queries
Supports comparing multiple CSV files (different configurations)
Shows percentage of questions answered for TREC evaluator
Displays mean and median values with confidence intervals
Automatically saves plots as PNG files

References

Ronak Pradeep et al. "Initial Nugget Evaluation Results for the TREC 2024 RAG Track with the AutoNuggetizer Framework." arXiv:2411.09607, 2024.
Shivani Upadhyay et al. "UMBRELA: UMBRELA is the (Open-Source Reproduction of the) Bing RELevance Assessor." arXiv:2406.06519, 2024.
TREC 2024 RAG Track: https://trec-rag.github.io/
Vectara Hallucination Evaluation Model: https://huggingface.co/vectara/hallucination_evaluation_model