Data Quality Metrics

May 28, 2026 · View on GitHub

This document provides comprehensive information about all quality metrics used in Dingo.

Note: All metrics are backed by academic sources to ensure objectivity and scientific rigor.

RAG Evaluation Metrics

TypeMetricDescriptionPaper SourceEvaluation ResultsExamples
LLMRAGAnswerRelevancyLLMRAGAnswerRelevancy评估答案是否直接回答问题,检测无关和冗余信息RAGAS: Automated Evaluation of Retrieval Augmented GenerationN/A📝 View Example
LLMRAGContextPrecisionLLMRAGContextPrecision评估检索上下文的精确度,包括相关性和排序质量RAGAS: Automated Evaluation of Retrieval Augmented GenerationN/A📝 View Example
LLMRAGContextRecallLLMRAGContextRecall评估检索上下文的完整性,判断上下文是否能支持答案中的所有陈述RAGAS: Automated Evaluation of Retrieval Augmented GenerationN/A📝 View Example
LLMRAGContextRelevancyLLMRAGContextRelevancy评估检索上下文与问题的相关性,检测噪声信息RAGAS: Automated Evaluation of Retrieval Augmented GenerationN/A📝 View Example
LLMRAGFaithfulnessLLMRAGFaithfulness评估生成答案是否忠实于给定上下文,检测幻觉和编造信息RAGAS: Automated Evaluation of Retrieval Augmented GenerationN/A📝 View Example

Pretrain Text Quality Assessment Metrics

TypeMetricDescriptionPaper SourceEvaluation ResultsExamples
LLMCodeCompareLLMCodeCompareCompares the effectiveness of two tools in extracting code blocks from HTML to Markdown format by evaluating recognit...Internal ImplementationN/AN/A
LLMDatamanAssessmentLLMDatamanAssessmentEvaluates pre-training data quality using the DataMan methodology (14 standards, 15 domains). Assigns a score (0/1), ...DataMan: Data Manager for Pre-training Large Language Models (Peng et al., 2025)N/AN/A
LLMHtmlExtractCompareV2LLMHtmlExtractCompareV2Compares two HTML main-content extraction tools by computing text diffs and using LLM to judge which preserves more c...Internal ImplementationN/AN/A
LLMHtmlExtractCompareV3LLMHtmlExtractCompareV3Compares two HTML extraction tools using LLM pretraining quality dimensions (completeness, effectiveness, similarity,...Internal ImplementationN/AN/A
LLMMathCompareLLMMathCompareCompares the effectiveness of two tools in extracting mathematical formulas from HTML to Markdown format by evaluatin...Internal ImplementationN/AN/A
LLMSecurityPoliticsLLMSecurityPoliticsEvaluates whether the text contains politics-related contentInternal ImplementationN/AN/A
LLMTableCompareLLMTableCompareCompares the effectiveness of two tools in extracting tables from HTML to Markdown format by evaluating recognition r...Internal ImplementationN/AN/A
LLMTextEquationLLMTextEquationImpact-driven text quality evaluation for LLM pretraining, focusing on structural completeness, readability, diversit...WanJuanSiLu: A High-Quality Open-Source Webtext Dataset for Low-Resource Languages (Yu et al., 2025)📊 See Results📝 View Example
LLMTextQualityV4LLMTextQualityV4Enhanced text quality evaluation covering completeness (formulas, tables, code), effectiveness (garbled text, spacing...WanJuanSiLu: A High-Quality Open-Source Webtext Dataset for Low-Resource Languages (Yu et al., 2025)📊 See ResultsN/A
LLMTextQualityV5LLMTextQualityV5Impact-driven text quality evaluation for LLM pretraining, focusing on structural completeness, readability, diversit...WanJuanSiLu: A High-Quality Open-Source Webtext Dataset for Low-Resource Languages (Yu et al., 2025)📊 See Results📝 View Example
LLMTextTableLLMTextTableImpact-driven text quality evaluation for LLM pretraining, focusing on structural completeness, readability, diversit...WanJuanSiLu: A High-Quality Open-Source Webtext Dataset for Low-Resource Languages (Yu et al., 2025)📊 See Results📝 View Example

SFT Data Assessment Metrics

TypeMetricDescriptionPaper SourceEvaluation ResultsExamples
LLMFactCheckPublicLLMFactCheckPublicTwo-stage factuality evaluation pipeline from GPT-5GPT-5 System Card (OpenAI)N/AN/A
LLMHallucinationLLMHallucinationEvaluates whether the response contains factual contradictions or hallucinations against provided context informationTruthfulQA: Measuring How Models Mimic Human Falsehoods (Lin et al., 2021)N/AN/A
LLMInstructionClarityLLMInstructionClarityEvaluates instruction clarity across four dimensions: self-descriptiveness, consistency, specificity, and completenessInternal Implementation[📊 See Results](Returns clarity score (0-10) and detailed analysis)📝 View Example
LLMTaskDifficultyLLMTaskDifficultyEvaluates task difficulty across cognitive complexity, step complexity, domain knowledge, and constraint densityInternal Implementation[📊 See Results](Returns difficulty level (1-10) with detailed breakdown)📝 View Example
LLMText3HHarmlessLLMText3HHarmlessChecks if responses avoid harmful content, discriminatory language, and dangerous assistanceTraining a Helpful and Harmless Assistant with Reinforcement Learning from Human Feedback (Bai et al., 2022)📊 See ResultsN/A
LLMText3HHelpfulLLMText3HHelpfulAssesses if responses address questions directly and follow instructions appropriatelyTraining a Helpful and Harmless Assistant with Reinforcement Learning from Human Feedback (Bai et al., 2022)📊 See ResultsN/A
LLMText3HHonestLLMText3HHonestEvaluates if responses provide accurate information without fabrication or deceptionTraining a Helpful and Harmless Assistant with Reinforcement Learning from Human Feedback (Bai et al., 2022)📊 See ResultsN/A
QUALITY_BAD_HALLUCINATIONRuleHallucinationHHEMUses Vectara's HHEM-2.1-Open model for local hallucination detection by evaluating consistency between response and c...HHEM-2.1-Open (Forrest Bao, Miaoran Li, Rogger Luo, Ofer Mendelevitch)N/AN/A

Classification Metrics

TypeMetricDescriptionPaper SourceEvaluation ResultsExamples
LLMClassifyTopicLLMClassifyTopicClassifies text into categories like language processing, writing, code, mathematics, role-play, or knowledge Q&A. Ba...BERTopic & INSTAG (Grootendorst, 2022; Wei et al., 2023)📊 See ResultsN/A

Multimodality Assessment Metrics

TypeMetricDescriptionPaper SourceEvaluation ResultsExamples
LLMClassifyQRLLMClassifyQRIdentifies images as CAPTCHA, QR code, or normal imagesInternal ImplementationN/AN/A
VLMOCRUnderstandingVLMOCRUnderstanding评估多模态模型对图片中文字内容的识别和理解能力,使用DeepSeek-OCR作为Ground TruthDeepSeek-OCR: Contexts Optical Compression[📊 See Results](通过对比VLM输出与OCR ground truth,识别文字遗漏、错误、幻觉等问题)N/A

Rule-Based TEXT Quality Metrics

TypeMetricDescriptionPaper SourceEvaluation ResultsExamples
QUALITY_BAD_COMPLETENESSRuleLineEndWithEllipsis, RuleLineEndWithTerminal, RuleSentenceNumber, RuleWordNumberChecks whether the ratio of lines ending with ellipsis is below threshold; Checks whether the ratio of lines ending w...RedPajama: an Open Dataset for Training Large Language Models (Together Computer, 2023)📊 See ResultsN/A
QUALITY_BAD_EFFECTIVENESSRuleDoi, RuleIsbn, RuleAbnormalChar, RuleAbnormalHtml, RuleAlphaWords, RuleAudioDataFormat, RuleCharNumber, RuleColonEnd, RuleContentNull, RuleContentShort, RuleContentShortMultiLan, RuleEnterAndSpace, RuleEnterMore, RuleEnterRatioMore, RuleHtmlEntity, RuleHtmlTag, RuleInvisibleChar, RuleImageDataFormat, RuleLatexSpecialChar, RuleLineJavascriptCount, RuleLoremIpsum, RuleMeanWordLength, RuleNlpDataFormat, RuleSftDataFormat, RuleSpaceMore, RuleSpecialCharacter, RuleStopWord, RuleSymbolWordRatio, RuleVedioDataFormat, RuleOnlyUrl, RuleDictConsistencyCheck whether the string is in the correct format of the doi; Check whether the string is in the correct format of th...Internal ImplementationN/AN/A
QUALITY_BAD_FLUENCYRuleAbnormalNumber, RuleCharSplit, RuleNoPunc, RuleWordSplit, RuleWordStuckChecks PDF content for abnormal book page or index numbers that disrupt text flow; Checks PDF content for abnormal ch...RedPajama: an Open Dataset for Training Large Language Models (Together Computer, 2023)📊 See ResultsN/A
QUALITY_BAD_RELEVANCERuleHeadWordAr, RuleHeadWordCs, RuleHeadWordHu, RuleHeadWordKo, RuleHeadWordRu, RuleHeadWordSr, RuleHeadWordTh, RuleHeadWordVi, RulePatternSearch, RuleWatermarkChecks whether Arabic content contains irrelevant tail source information; Checks whether Czech content contains irre...RedPajama: an Open Dataset for Training Large Language Models (Together Computer, 2023)📊 See ResultsN/A
QUALITY_BAD_SECURITYRuleIDCard, RuleUnsafeWords, RulePIIDetectionChecks whether content contains ID card information; Checks whether content contains unsafe words; Detects Personal I...RedPajama: an Open Dataset for Training Large Language Models (Together Computer, 2023)📊 See ResultsN/A
QUALITY_BAD_SIMILARITYRuleDocRepeat, RuleDocFormulaRepeatEvaluates text for consecutive repeated content and multiple occurrences of special characters; Evaluates text for co...RedPajama: an Open Dataset for Training Large Language Models (Together Computer, 2023)📊 See ResultsN/A
QUALITY_BAD_UNDERSTANDABILITYRuleCapitalWords, RuleCurlyBracket, RuleLineStartWithBulletpoint, RuleUniqueWordsChecks whether the ratio of capital words is above threshold, indicating poor readability; Checks whether the ratio o...RedPajama: an Open Dataset for Training Large Language Models (Together Computer, 2023)📊 See ResultsN/A

Rule-Based IMG Quality Metrics

TypeMetricDescriptionPaper SourceEvaluation ResultsExamples
QUALITY_BAD_IMG_ARTIMUSERuleImageArtimuseEvaluates image quality in the field of aesthetics using artimuseInternal ImplementationN/AN/A
QUALITY_BAD_IMG_EFFECTIVENESSRuleImageValid, RuleImageSizeValid, RuleImageQualityChecks whether image is not all white or black, ensuring visual content validity; Checks whether image ratio of width...Internal ImplementationN/AN/A
QUALITY_BAD_IMG_LABEL_OVERLAPRuleImageLabelOverlapDetects overlapping bounding boxes in image annotations, marks full/partial overlap and generates visualization imagesInternal ImplementationN/AN/A
QUALITY_BAD_IMG_LABEL_VISUALIZATIONRuleImageLabelVisualizationGenerates visualization images with bounding boxes and category labels, helping manual check of annotation accuracyInternal ImplementationN/AN/A
QUALITY_BAD_IMG_RELEVANCERuleImageTextSimilarityEvaluates semantic similarity between image and text content using CLIP modelLearning Transferable Visual Representations with Natural Language Supervision (Radford et al., 2021)N/AN/A
QUALITY_BAD_IMG_SIMILARITYRuleImageRepeatDetects duplicate images using PHash and CNN methods to ensure data diversityImageNet Classification with Deep Convolutional Neural Networks (Krizhevsky et al., 2012)N/AN/A

Audio Quality Metrics

TypeMetricDescriptionPaper SourceEvaluation ResultsExamples
QUALITY_BAD_EFFECTIVENESSRuleAudioDurationCheck whether the audio duration meets the standardInternal ImplementationN/AN/A
QUALITY_BAD_EFFECTIVENESSRuleAudioSnrQualityCheck whether the audio signal-to-noise ratio meets the standardInternal ImplementationN/AN/A

Job Hunting Strategy Metrics

TypeMetricDescriptionPaper SourceEvaluation ResultsExamples
LLMScoutLLMScoutStrategic job hunting analysis with industry report parsing and person-job matchingInternal ImplementationN/AN/A

Meta Rater Evaluation Metrics

TypeMetricDescriptionPaper SourceEvaluation ResultsExamples
LLMMetaRaterCleanlinessLLMMetaRaterCleanlinessEvaluates text formatting, content appropriateness, and completeness, assessing whether text appears human-edited and...Meta-rater: A Multi-dimensional Data Selection Method for Pre-training Language Models (Zhuang et al., 2025)N/AN/A
LLMMetaRaterProfessionalismLLMMetaRaterProfessionalismEvaluates the degree of expertise and prerequisite knowledge required to comprehend text on a 5-point scaleMeta-rater: A Multi-dimensional Data Selection Method for Pre-training Language Models (Zhuang et al., 2025)N/AN/A
LLMMetaRaterReadabilityLLMMetaRaterReadabilityEvaluates the clarity and coherence of text using appropriate vocabulary and sentence structures on a 5-point scaleMeta-rater: A Multi-dimensional Data Selection Method for Pre-training Language Models (Zhuang et al., 2025)N/AN/A
LLMMetaRaterReasoningLLMMetaRaterReasoningEvaluates the reasoning complexity and logical depth of text content, from simple logical judgments to complex multid...Meta-rater: A Multi-dimensional Data Selection Method for Pre-training Language Models (Zhuang et al., 2025)N/AN/A

OCR Eval Metric

TypeMetricDescriptionPaper SourceEvaluation ResultsExamples
LLMMinerURecognizeQualityLLMMinerURecognizeQualityEvaluate the quality of mineru recognizeInternal Implementation[📊 See Results](error_category and error_label)N/A
VLMDocumentParsingOCRTrainVLMDocumentParsingOCRTrainEvaluate the quality of mineru recognizeInternal Implementation[📊 See Results](error_category and error_label)N/A

RAG Retrieved Evidence Chunk Quality Metrics

TypeMetricDescriptionPaper SourceEvaluation ResultsExamples
LLMChunkQualityLLMChunkQualityAssesses retrieved citation chunks referenced by LLM answers, detecting start-boundary truncation and duplicated lead...Internal ImplementationN/A📝 View Example

Resume Quality Assessment Metrics

TypeMetricDescriptionPaper SourceEvaluation ResultsExamples
LLMKeywordMatcherLLMKeywordMatcherSemantic keyword matching between resume and job descriptionInternal ImplementationN/AN/A
LLMResumeOptimizerLLMResumeOptimizerATS-focused resume optimization with keyword injection and STAR polishingInternal ImplementationN/AN/A
LLMResumeQualityLLMResumeQualityComprehensive resume quality evaluation covering privacy, contact, format, structure, professionalism, date, and comp...Internal ImplementationN/AN/A

Rule-Based Metadata Quality Metrics

TypeMetricDescriptionPaper SourceEvaluation ResultsExamples
QUALITY_BAD_EFFECTIVENESSRuleMetadataSimilarity检查元数据字段与基准数据的相似度匹配,阈值默认为0.6Internal ImplementationN/AN/A

Rule-Based RESUME Quality Metrics

TypeMetricDescriptionPaper SourceEvaluation ResultsExamples
RESUME_QUALITY_BAD_COMPLETENESSRuleResumeEducationMissing, RuleResumeExperienceMissingChecks if resume contains education background information; Checks if resume contains work experience informationInternal ImplementationN/AN/A
RESUME_QUALITY_BAD_CONTACTRuleResumeEmailMissing, RuleResumePhoneMissing, RuleResumePhoneFormatChecks if resume contains a valid email address; Checks if resume contains a valid phone number; Validates phone numb...Internal ImplementationN/AN/A
RESUME_QUALITY_BAD_DATERuleResumeDateFormatDetects inconsistent date format usage in resumeInternal ImplementationN/AN/A
RESUME_QUALITY_BAD_FORMATRuleResumeExcessiveWhitespace, RuleResumeMarkdownDetects excessive consecutive spaces in resume; Detects common Markdown syntax errors in resumeInternal ImplementationN/AN/A
RESUME_QUALITY_BAD_PRIVACYRuleResumeIDCard, RuleResumeDetailedAddressDetects 18-digit Chinese ID card numbers in resume content; Detects detailed address patterns that may leak privacyInternal ImplementationN/AN/A
RESUME_QUALITY_BAD_PROFESSIONALISMRuleResumeEmoji, RuleResumeInformalDetects emoji usage in resume which reduces professionalism; Detects informal or colloquial expressions in resumeInternal ImplementationN/AN/A
RESUME_QUALITY_BAD_STRUCTURERuleResumeNameMissing, RuleResumeSectionMissingChecks if resume contains a name in the first 200 characters; Checks if resume contains required sections like educat...Internal ImplementationN/AN/A

SFT Data Assessment Metrics - Agent-Enhanced

TypeMetricDescriptionPaper SourceEvaluation ResultsExamples
AgentHallucinationAgentHallucinationAgent-based hallucination detection with automatic web search for missing contextInternal ImplementationN/AN/A

Text Generation

TypeMetricDescriptionPaper SourceEvaluation ResultsExamples
LLMLongVideoQaLLMLongVideoQaGenerate video-related question-answer pairs based on the summarized information of the input long video.VRBench: A Benchmark for Multi-Step Reasoning in Long Narrative Videos (Jiashuo Yu et al., 2025)N/AN/A

Other Metrics

TypeMetricDescriptionPaper SourceEvaluation ResultsExamples
AgentFactCheckAgentFactCheckAgent-based hallucination detection with autonomous web searchInternal ImplementationN/AN/A
ArticleFactCheckerArticleFactCheckerArticle-level fact checking with autonomous claims extraction and verificationInternal ImplementationN/AN/A
LLMCustomMetricLLMCustomMetricUnified metric for user customizationInternal ImplementationN/AN/A