README.md

November 24, 2025 · View on GitHub

Awesome LLM Eval

English | 中文

Awesome-LLM-Eval: a curated list of tools, datasets/benchmark, demos, leaderboard, papers, docs and models, mainly for Evaluation on Large Language Models and exploring the boundaries and limits of Generative AI.

The is the official project of our survey: Beyond Benchmark: LLMs Evaluation with an Anthropomorphic and Value-oriented Roadmap.

NOTE: As we cannot update the arXiv paper in real time, please refer to this repo for the latest updates and the paper may be updated later. We also welcome any pull request or issues to help us improve this work. Your contributions will be acknowledged in acknowledgements.

If you find our survey useful, please kindly cite our paper:

@misc{wang2025llmevalroadmap,
      title={Beyond Benchmark: LLMs Evaluation with an Anthropomorphic and Value-oriented Roadmap}, 
      author={Jun Wang and Ninglun Gu and Kailai Zhang and Zijiao Zhang and Yelun Bao and Jin Yang and Xu Yin and Liwei Liu and Yihuan Liu and Pengyong Li and Gary G. Yen and Junchi Yan},
      year={2025},
      eprint={2508.18646},
      archivePrefix={arXiv},
      primaryClass={cs.AI},
      url={https://arxiv.org/abs/2508.18646}, 
}

Table of Contents

News

Anthropomorphic-Taxonomy

Typical Intelligence Quotient (IQ)-General Intelligence evaluation benchmarks

NameYearTask TypeInstitutionEvaluation FocusDatasetsUrl
MMLU-Pro2024Multi-Choice KnowledgeTIGER-AI-LabSubtle Reasoning, Fewer NoiseMMLU-Prolink
DyVal2024Dynamic EvaluationMicrosoftData Pollution, Complexity ControlDyVallink
PertEval2024GeneralUSTCKnowledge capacityPertEvallink
LV-Eval2024Long Text QAInfinigence-AILength Variability, Factuality11 Subsetslink
LLM-Uncertainty-Bench2024NLP TasksTencentUncertainty Quantification5 NLP Taskslink
CommonGen-Eval2024GenerationAI2Common SenseCommonGen-litelink
MathBench2024MathShanghai AI LabTheoretical and practical problem-solvingVariouslink
AIME2024MathMAAAmerican Invitational Mathematics ExaminationVariouslink
FrontierMath2024MathEpoch AIOriginal, challenging mathematics problemsVariouslink
FELM2023FactualityHKUSTFactuality847 Questionslink
Just-Eval-Instruct2023GeneralAI2 MosaicHelpfulness, ExplainabilityVariouslink
MLAgentBench2023ML Researchsnap-stanfordEnd-to-End ML Tasks15 Taskslink
UltraEval2023GeneralOpenBMBLightweight, Flexible, FastVariouslink
FMTI2023TransparencyStanfordModel Transparency100 Metricslink
BAMBOO2023Long TextRUCAIBoxLong Text Modeling10 Datasetslink
TRACE2023Continuous LearningFudan UniversityContinuous Learning8 Datasetslink
ColossalEval2023GeneralColossal-AIUnified EvaluationVariouslink
LLMEval²2023GeneralAlibabaResearchWide and Deep Evaluation2,553 Sampleslink
BigBench2023GeneralGoogleknowledge, language, reasoningVariouslink
LucyEval2023GeneralOracleMaturity AssessmentVariouslink
Zhujiu2023GeneralIACASComprehensive Evaluation51 Taskslink
ChatEval2023ChatTHU-NLPHuman-like EvaluationVariouslink
FlagEval2023GeneralTHUSubjective and Objective ScoringVariouslink
AlpacaEval2023Generaltatsu-labAutomatic EvaluationVariouslink
GPQA2023GeneralNYUGraduate-Level Google-Proof QAVariouslink
MuSR2023ReasoningZayne SpragueNarrative-Based Reasoning756link
FreshQA2023KnowledgeFreshLLMsCurrent World Knowledge599link
AGIEval2023GeneralMicrosoftHuman-Centric ReasoningNAlink
SummEdits2023GeneralSalesforceInconsistency Detection6,348link
ScienceQA2022ReasoningUCLAScience Reasoning21,208link
e-CARE2022ReasoningHITExplainable Causality21,000link
BigBench Hard2022ReasoningBigBenchChallenging Subtasks6,500link
PlanBench2022ReasoningASUAction Planning11,113link
MGSM2022MathGoogleGrade-school math problems in 10 languagesVariouslink
MATH2021MathUC BerkeleyMathematical Problem SolvingVariouslink
GSM8K2021MathOpenAIDiverse grade school math word problemsVariouslink
SVAMP2021MathMicrosoftArithmetic Reasoning1,000link
SpartQA2021ReasoningMSUTextual Spatial QA510link
MLSUM2020GeneralThomas ScialomNews Summarization535,062link
Natural Questions2019Language, ReasoningGoogleSearch-Based QA300,000link
ANLI2019Language, ReasoningFacebook AIAdversarial Reasoning169,265link
BoolQ2019Language, ReasoningGoogleBinary QA16,000link
SuperGLUE2019Language, ReasoningNYUAdvanced GLUE TasksNAlink
DROP2019Language, ReasoningUCI NLPParagraph-Level Reasoning96,000link
HellaSwag2019Language, ReasoningAI2Commonsense Inference59,950link
Winogrande2019Language, ReasoningAI2Pronoun Disambiguation44,000link
PIQA2019Language, ReasoningAI2Physical Interaction QA18,000link
HotpotQA2018Language, ReasoningHotpotQAExplainable QA113,000link
GLUE2018Language, ReasoningNYUFoundational NLU TasksNAlink
OpenBookQA2018Language, ReasoningAI2Open Book Exams12,000link
SQuAD2.02018Language, ReasoningStanford UniversityUnanswerable Questions150,000link
ARC2018Language, ReasoningAI2AI2 Reasoning Challenge7,787link
SWAG2018Language, ReasoningAI2Adversarial Commonsense113,000link
CommonsenseQA2018Language, ReasoningAI2Commonsense Reasoning12,102link
RACE2017Language, ReasoningCMUExam-Style QA100,000link
SciQ2017Language, ReasoningAI2Crowd-Sourced Science13,700link
TriviaQA2017Language, ReasoningAI2Distant Supervision650,000link
MultiNLI2017Language, ReasoningNYUCross-Genre Entailment433,000link
SQuAD2016Language, ReasoningStanford UniversityWikipedia-Based QA100,000link
LAMBADA2016Language, ReasoningCIMECDiscourse Context12,684link
MS MARCO2016Language, ReasoningMicrosoftSearch-Based QA1,112,939link

Typical Professional Quotient (PQ)-Professional Expertise evaluation benchmarks

DomainNameInstitutionScope of TasksUnique ContributionsUrl
BLURBMindrank AISix diverse NLP tasks, thirteen datasetsA macro-average score across all taskslink
SeismometerEpicUsing local data and workflowspatient demographics, clinical interventions, and outcomeslink
HealthcareMedbenchOpenMEDLabEmphasizes scientific rigor and fairness40,041 questions from medical exams and reportslink
GenMedicalEvalE16 majors, 3 training stages, 6 clinical scenariosOpen-ended metrics and automated assessment modelslink
PsyEvalSJTUSix subtasks covering three dimensionsCustomized benchmark for mental health LLMslink
Fin-EvaAnt GroupWealth management, insurance, investment researchBoth industrial and academic financial evaluationslink
FinanceFinEvalSUFE-AIFLM-LabMultiple-choice QA on finance, economics, accountingFocuses on high-quality evaluation questionslink
OpenFinDataShanghai AI LabMulti-scenario financial tasksFirst comprehensive finance evaluation datasetlink
FinBenFinAI35 datasets across 23 financial tasksInductive reasoning, quantitative reasoninglink
LAiWSichuan University13 fundamental legal NLP tasksDivides legal NLP capabilities into three major abilitieslink
LegalLawBenchNanjing UniversityLegal entity recognition, reading comprehensionReal-world tasks, "abstention rate" metriclink
LegalBenchStanford University162 tasks covering six types of legal reasoningEnables interdisciplinary conversationslink
LexEvalTsinghua UniversityLegal cognitive abilities to organize different tasksLarger legal evaluation dataset, examining the ethical issueslink
SPEC5GPurdue UniversitySecurity-related text classification and summarization5G protocol analysis automationlink
TelecomTeleQnAHuawei(Paris)General telecom inquiriesProficiency in telecom-related questionslink
OpsEvalTsinghua UniversityWired network ops, 5G, database opsFocus on AIOps, evaluates proficiencylink
TelBenchSK TelecomMath modeling, open-ended QA, code generationHolistic evaluation in telecomlink
TelecomGPTUAETelecom Math Modeling, Open QnA and Code TasksHolistic evaluation in telecomlink
LinguisticQueen's UniversityMultiple language-centric taskszero-shot evaluationlink
TelcoLMOrangeMultiple-choice questionnairesDomain-specific data (800M tokens, 80K instructions)link
ORAN-Bench-13KGMUMultiple-choice questionsOpen Radio Access Networks (O-RAN)link
Open-Telco BenchmarksGSMAMultiple language-centric taskszero-shot evaluationlink
FullStackBenchByteDanceCode writing, debugging, code reviewFeaturing the most recent Stack Overflow QAlink
CodingStackEvalProsus AI11 real-world scenarios, 16 languagesEvaluation across diverse & practical coding environmentslink
CodeBenchGenVarious InstitutionsExecution-based code generation tasksBenchmarks scaling with the size and complexitylink
HumanEvalUniversity of WashingtonRigorous testingStricter protocol for assessing correctness of generated codelink
APPSUniversity of CaliforniaCoding challenges from competitive platformsChecking problem-solving of generated code on test caseslink
MBPPGoogle ResearchProgramming problems sourced from various originsDiverse programming taskslink
ClassEvalTsinghua UniversityClass-level code generationManually crafted, object-oriented programming conceptslink
CoderEvalPeking UniversityPragmatic code generationProficiency to generate functional code patches for described issueslink
MultiPL-EPrinceton UniversityNeural code generationBenchmarking neural code generation modelslink
CodeXGLUEMicrosoftCode intelligenceWide tasks covering: code-code, text-code, code-text and text-textlink
EvoCodeBenchPeking UniversityEvolving code generation benchmarkAligned with real-world code repositories, evolving over timelink

Typical Emotional Quotient (EQ)-Alignment Ability evaluation benchmarks

NameYearTask TypeInstitutionCategoryDatasetsUrl
DiffAware2025BiasStanfordGeneral Bias8 datasetslink
CASE-Bench2025SafetyCambridgeContext-Aware SafetyCASE-Benchlink
Fairness2025FairnessPSUDistributive Fairness--
HarmBench2024SafetyUIUCAdversarial Behaviors510link
SimpleQA2024SafetyOpenAIFactuality4,326link
AgentHarm2024SafetyBEISMalicious Agent Tasks110link
StrongReject2024SafetydsbowenAttack Resistancen/alink
LLMBar2024InstructionPrincetonInstruction Following419 Instanceslink
AIR-Bench2024SafetyStanfordRegulatory Alignment5,694link
TrustLLM2024GeneralTrustLLMTrustworthiness30+link
RewardBench2024AlignmentAIAIHuman preferenceRewardBenchlink
EQ-Bench2024EmotionPaechEmotional intelligence171 Questionslink
Forbidden2023SafetyCISPAJailbreak Detection15,140link
MaliciousInstruct2023SafetyPrincetonMalicious Intentions100link
SycophancyEval2023SafetyAnthropicOpinion Alignmentn/alink
DecodingTrust2023SafetyUIUCTrustworthiness243,877link
AdvBench2023SafetyCMUAdversarial Attacks1,000link
XSTest2023SafetyBocconiSafety Overreach450link
OpinionQA2023Safetytatsu-labDemographic Alignment1,498link
SafetyBench2023SafetyTHUContent Safety11,435link
HarmfulQA2023Safetydeclare-labHarmful Topics1,960link
QHarm2023SafetyvinidSafety Sampling100link
BeaverTails2023SafetyPKURed Teaming334,000link
DoNotAnswer2023SafetyLibr-AISafety Mechanisms939link
AlignBench2023AlignmentTHUDMAlignment, ReliabilityVariouslink
IFEval2023InstructionGoogleInstruction Following500 Promptslink
ToxiGen2022SafetyMicrosoftToxicity Detection274,000link
HHH2022SafetyAnthropicHuman Preferences44,849link
RedTeam2022SafetyAnthropicRed Teaming38,921link
BOLD2021BiasAmazonBias in Generation23,679link
BBQ2021BiasNYUSocial Bias58,492link
StereoSet2020BiasMcGillStereotype Detection4,229link
ETHICS2020EthicsBerkeleyMoral Judgement134,400link
ToxicityPrompt2020SafetyAllenAIToxicity Assessment99,442link
CrowS-Pairs2020BiasNYUStereotype Measurement1,508link
SEAT2019BiasPrincetonEncoder Biasn/alink
WinoGender2018BiasUMassGender Bias720link

Tools

NameOrganizationWebsiteDescription
prometheus-evalprometheus-evalprometheus-evalPROMETHEUS Open Evaluation Dedicated Language Model, which is more powerful than its predecessor. It can closely imitate the judgments of humans and GPT-4. Additionally, it can handle both direct evaluation and pairwise ranking formats, and can be used with user-defined evaluation criteria. On four direct evaluation benchmarks and four pairwise ranking benchmarks, PROMETHEUS 2 achieves the highest correlation and consistency with human evaluators and proprietary language models among all tested open-source evaluation language models (2024-05-04).
athina-evalsathina-aiathina-aiAthina-ai is an open-source library that provides plug-and-play preset evaluations and a modular, extensible framework for writing and running evaluations. It helps engineers systematically improve the reliability and performance of their large language models through evaluation-driven development. Athina-ai offers a system for evaluation-driven development, overcoming the limitations of traditional workflows, enabling rapid experimentation, and providing customizable evaluators with consistent metrics.
LeaderboardFinderHuggingfaceLeaderboardFinderLeaderboardFinder helps you find suitable leaderboards for specific scenarios, a leaderboard of leaderboards (2024-04-02).
LightEvalHuggingfacelightevalLightEval is a lightweight framework developed by Hugging Face for evaluating large language models (LLMs). Originally designed as an internal tool for assessing Hugging Face's recently released LLM data processing library datatrove and LLM training library nanotron, it is now open-sourced for community use and improvement. Key features of LightEval include: (1) lightweight design, making it easy to use and integrate; (2) an evaluation suite supporting multiple tasks and models; (3) compatibility with evaluation on CPUs or GPUs, and integration with Hugging Face's acceleration library (Accelerate) and frameworks like Nanotron; (4) support for distributed evaluation, which is particularly useful for evaluating large models; (5) applicability to all benchmarks on the Open LLM Leaderboard; and (6) customizability, allowing users to add new metrics and tasks to meet specific evaluation needs (2024-02-08).
LLM ComparatorGoogleLLM ComparatorA visual analytical tool for comparing and evaluating large language models (LLMs). Compared to traditional human evaluation methods, this tool offers a scalable automated approach to comparative evaluation. It leverages another LLM as an evaluator to demonstrate quality differences between models and provide reasons for these differences. Through interactive tables and summary visualizations, the LLM Comparator helps users understand why models perform well or poorly in specific contexts, as well as the qualitative differences between model responses. Developed in collaboration with Google researchers and engineers, this tool has been widely used internally at Google, attracting over 400 users and evaluating more than 1,000 experiments within three months (2024-02-16).
Arthur BenchArthur-AIArthur BenchArthur Bench is an open-source evaluation tool designed to compare and analyze the performance of large language models (LLMs). It supports various evaluation tasks, including question answering, summarization, translation, and code generation, and provides detailed reports on LLM performance across these tasks. Key features and advantages of Arthur Bench include: (1) model comparison, enabling the evaluation of different suppliers, versions, and training datasets of LLMs; (2) prompt and hyperparameter evaluation, assessing the impact of different prompts on LLM performance and testing the control of model behavior through various hyperparameter settings; (3) task definition and model selection, allowing users to define specific evaluation tasks and select evaluation targets from a range of supported LLM models; (4) parameter configuration, enabling users to adjust prompts and hyperparameters to finely control LLM behavior; (5) automated evaluation workflows, simplifying the execution of evaluation tasks; and (6) application scenarios such as model selection and validation, budget and privacy optimization, and the transformation of academic benchmarks into real-world performance evaluations. Additionally, it offers comprehensive scoring metrics, supports both local and cloud versions, and encourages community collaboration and project development (2023-10-06).
llm-benchmarker-suiteFormulaMonksllm-benchmarker-suiteThis open-source initiative aims to address fragmentation and ambiguity in LLM benchmarking. The suite provides a structured methodology, a collection of diverse benchmarks, and toolkits to streamline the assessment of LLM performance. By offering a common platform, this project seeks to promote collaboration, transparency, and high-quality research in NLP.
autoevalsbraintrustautoevalsAutoEvals is an AI model output evaluation tool that leverages best practices to quickly and easily assess AI model outputs. It integrates multiple automatic evaluation methods, supports customizable evaluation prompts and custom scorers, and simplifies the evaluation process of model outputs. Autoevals incorporates model-graded evaluation for various subjective tasks, including fact-checking, safety, and more. Many of these evaluations are adapted from OpenAI's excellent evals project but are implemented in a flexible way to allow users to tweak prompts and debug outputs.
EVALOPENAIEVALEVAL is a tool developed by OpenAI for evaluating large language models (LLMs). It can test the performance and generalization capabilities of models across different tasks and datasets.
lm-evaluation-harnessEleutherAIlm-evaluation-harnesslm-evaluation-harness is a tool developed by EleutherAI for evaluating large language models (LLMs). It can test the performance and generalization capabilities of models across different tasks and datasets.
lm-evaluationAI21Labslm-evaluationEvaluations and reproducing the results from the Jurassic-1 Technical Paper, with current support for running tasks via both the AI21 Studio API and OpenAI's GPT-3 API.
OpenCompassShanghai AI LabOpenCompassOpenCompass is a one-stop platform for evaluating large models. Its main features include: open-source and reproducible evaluation schemes; comprehensive capability dimensions covering five major areas with over 50 datasets and approximately 300,000 questions to assess model capabilities; support for over 20 Hugging Face and API models; distributed and efficient evaluation with one-line task splitting and distributed evaluation, enabling full evaluation of trillion-parameter models within hours; diverse evaluation paradigms supporting zero-shot, few-shot, and chain-of-thought evaluations with standard or conversational prompt templates to easily elicit peak model performance.
Large language model evaluation and workflow framework from Phase AIwgrycphasellmA framework provided by Phase AI for evaluating and managing LLMs, helping users select appropriate models, datasets, and metrics, as well as visualize and analyze results.
Evaluation benchmark for LLMFreedomIntelligenceLLMZooLLMZoo is an evaluation benchmark for LLMs developed by FreedomIntelligence, featuring multiple domain and task datasets, metrics, and pre-trained models with results.
Holistic Evaluation of Language Models (HELM)StanfordHELMHELM is a comprehensive evaluation method for LLMs proposed by the Stanford research team, considering multiple aspects such as model language ability, knowledge, reasoning, fairness, and safety.
A lightweight evaluation tool for question-answeringLangchainauto-evaluatorauto-evaluator is a lightweight tool developed by Langchain for evaluating question-answering systems. It can automatically generate questions and answers and calculate metrics such as model accuracy, recall, and F1 score.
PandaLMWeOpenMLPandaLMPandaLM is an LLM assessment tool developed by WeOpenML for automated and reproducible evaluation. It allows users to select appropriate datasets, metrics, and models based on their needs and preferences, and generates reports and charts.
FlagEvalTsinghua UniversityFlagEvalFlagEval is an evaluation platform for LLMs developed by Tsinghua University, offering multiple tasks and datasets, as well as online testing, leaderboards, and analysis functions.
AlpacaEvaltatsu-labalpaca_evalAlpacaEval is an evaluation tool for LLMs developed by tatsu-lab, capable of testing models across various languages, domains, and tasks, and providing explainability, robustness, and credibility metrics.
Prompt flowMicrosoftpromptflowA set of development tools designed by Microsoft to simplify the end-to-end development cycle of AI applications based on LLMs, from conception, prototyping, testing, and evaluation to production deployment and monitoring. It makes prompt engineering easier and enables the development of product-level LLM applications.
DeepEvalmr-gptDeepEvalDeepEval is a simple-to-use, open-source LLM evaluation framework. Similar to Pytest but specialized for unit testing LLM outputs, it incorporates the latest research to evaluate LLM outputs based on metrics such as G-Eval, hallucination, answer relevance, RAGAS, etc., utilizing LLMs and various other NLP models that run locally on your machine for evaluation.
CONNERTencent AI LabCONNERCONNER is a comprehensive large model knowledge evaluation framework designed to systematically and automatically assess the information generated from six critical perspectives: factuality, relevance, coherence, informativeness, usefulness, and validity.

Datasets or Benchmarks

General

NameOrganizationWebsiteDescription
MMLU-ProTIGER-AI-LabMMLU-ProMMLU-Pro is an improved version of the MMLU dataset. MMLU has long been a reference for multiple-choice knowledge datasets. However, recent studies have shown that it contains noise (some questions are unanswerable) and is too easy (due to the evolution of model capabilities and increased contamination). MMLU-Pro provides ten options instead of four, requires reasoning on more questions, and has undergone expert review to reduce noise. It is of higher quality and more challenging than the original. MMLU-Pro reduces the impact of prompt variations on model performance, a common issue with its predecessor MMLU. Research indicates that models using "Chain of Thought" reasoning perform better on this new benchmark, suggesting that MMLU-Pro is better suited for evaluating the subtle reasoning abilities of AI. (2024-05-20)
TrustLLM BenchmarkTrustLLMTrustLLMTrustLLM is a benchmark for evaluating the trustworthiness of large language models. It covers six dimensions of trustworthiness and includes over 30 datasets to comprehensively assess the functional capabilities of LLMs, ranging from simple classification tasks to complex generative tasks. Each dataset presents unique challenges and has benchmarked 16 mainstream LLMs (including commercial and open-source models).
DyValMicrosoftDyValConcerns have been raised about the potential data contamination in the vast training corpora of LLMs. Additionally, the static nature and fixed complexity of current benchmarks may not adequately measure the evolving capabilities of LLMs. DyVal is a general and flexible protocol for dynamically evaluating LLMs. Leveraging the advantages of directed acyclic graphs, DyVal dynamically generates evaluation samples with controllable complexity. It has created challenging evaluation sets for reasoning tasks such as mathematics, logical reasoning, and algorithmic problems. Various LLMs, from Flan-T5-large to GPT-3.5-Turbo and GPT-4, have been evaluated. Experiments show that LLMs perform worse on DyVal-generated samples of different complexities, highlighting the importance of dynamic evaluation. The authors also analyze failure cases and results of different prompting methods. Furthermore, DyVal-generated samples not only serve as evaluation sets but also aid in fine-tuning to enhance LLM performance on existing benchmarks. (2024-04-20)
RewardBenchAIAIRewardBenchRewardBench is an evaluation benchmark for language model reward models, assessing the strengths and weaknesses of various models. It reveals that existing models still exhibit significant shortcomings in reasoning and instruction following. It includes a Leaderboard, Code, and Dataset (2024-03-20).
LV-EvalInfinigence-AILVEvalLV-Eval is a long-text evaluation benchmark featuring five length tiers (16k, 32k, 64k, 128k, and 256k), with a maximum text test length of 256k. The average text length of LV-Eval is 102,380 characters, with a minimum/maximum text length of 11,896/387,406 characters. LV-Eval primarily consists of two types of evaluation tasks: single-hop QA and multi-hop QA, encompassing 11 sub-datasets in Chinese and English. During its design, LV-Eval introduced three key technologies: Confusion Facts Insertion (CFI) to enhance challenge, Keyword and Phrase Replacement (KPR) to reduce information leakage, and Answer Keywords (AK) based evaluation metrics (combining answer keywords and word blacklists) to improve the objectivity of evaluation results (2024-02-06).
LLM-Uncertainty-BenchTencentLLM-Uncertainty-BenchA new benchmark method for LLMs has been introduced, incorporating uncertainty quantification. Based on nine LLMs tested across five representative NLP tasks, it was found that: I) More accurate LLMs may exhibit lower certainty; II) Larger-scale LLMs may display greater uncertainty than smaller models; III) Instruction fine-tuning tends to increase the uncertainty of LLMs. These findings underscore the importance of including uncertainty in LLM evaluations (2024-01-22).
Psychometrics EvalMicrosoft Research AsiaPsychometrics EvalMicrosoft Research Asia has proposed a generalized evaluation method for AI based on psychometrics, aiming to address limitations in traditional evaluation methods concerning predictive power, information volume, and test tool quality. This approach draws on psychometric theories to identify key psychological constructs of AI, design targeted tests, and apply Item Response Theory for precise scoring. It also introduces concepts of reliability and validity to ensure evaluation reliability and accuracy. This framework extends psychometric methods to assess AI performance in handling unknown complex tasks but also faces open questions such as distinguishing between AI "individuals" and "populations," addressing prompt sensitivity, and evaluating differences between human and AI constructs (2023-10-19).
CommonGen-EvalAllenAICommonGen-EvalA study using the CommonGen-lite dataset to evaluate LLMs, employing GPT-4 for assessment and comparing the performance of different models, with results listed on the leaderboard (2024-01-04).
felmHKUSTfelmFELM is a meta-benchmark for evaluating the factual assessment of large language models. The benchmark comprises 847 questions spanning five distinct domains: world knowledge, science/technology, writing/recommendation, reasoning, and mathematics. Prompts corresponding to each domain are gathered from various sources, including standard datasets like TruthfulQA, online platforms like GitHub repositories, ChatGPT-generated prompts, or those drafted by authors. For each response, fine-grained annotation at the segment level is employed, including reference links, identified error types, and reasons behind these errors as provided by annotators (2023-10-03).
just-evalAI2 Mosaicjust-evalA GPT-based evaluation tool for multi-faceted and explainable assessment of LLMs, capable of evaluating aspects such as helpfulness, clarity, factuality, depth, and engagement (2023-12-05).
EQ-BenchEQ-BenchEQ-BenchA benchmark for evaluating the emotional intelligence of language models, featuring 171 questions (compared to 60 in v1) and a new scoring system that better distinguishes performance differences among models (2023-12-20).
CRUXEvalMIT CSAILCRUXEvalCRUXEval is a benchmark for evaluating code reasoning, understanding, and execution. It includes 800 Python functions and their input-output pairs, testing input prediction and output prediction tasks. Many models that perform well on HumanEval underperform on CRUXEval, highlighting the need for improved code reasoning capabilities. The best model, GPT-4 with chain-of-thought (CoT), achieved pass@1 rates of 75% and 81% for input prediction and output prediction, respectively. The benchmark exposes gaps between open-source and closed-source models. GPT-4 failed to fully pass CRUXEval, providing insights into its limitations and directions for improvement (2024-01-05).
MLAgentBenchsnap-stanfordMLAgentBenchMLAgentBench is a suite of end-to-end machine learning (ML) research tasks for benchmarking AI research agents. These agents aim to autonomously develop or improve an ML model based on a given dataset and ML task description. Each task represents an interactive environment that directly reflects what human researchers encounter. Agents can read available files, run multiple experiments on compute clusters, and analyze results to achieve the specified research objectives. Specifically, it includes 15 diverse ML engineering tasks that can be accomplished by attempting different ML methods, data processing, architectures, and training processes (2023-10-05).
AlignBenchTHUDMAlignBenchAlignBench is a comprehensive and multi-dimensional benchmark for evaluating the alignment performance of Chinese large language models. It constructs a human-in-the-loop data creation process to ensure dynamic data updates. AlignBench employs a multi-dimensional, rule-based model evaluation method (LLM-as-Judge) and combines chain-of-thought (CoT) to generate multi-dimensional analyses and final comprehensive scores for model responses, enhancing the reliability and explainability of evaluations (2023-12-01).
UltraEvalOpenBMBUltraEvalUltraEval is an open-source foundational model capability evaluation framework offering a lightweight and easy-to-use evaluation system that supports mainstream large model performance assessments. Its key features include: (1) a lightweight and user-friendly evaluation framework with intuitive design, minimal dependencies, easy deployment, and good scalability for various evaluation scenarios; (2) flexible and diverse evaluation methods with unified prompt templates and rich evaluation metrics, supporting customization; (3) efficient and rapid inference deployment supporting multiple model deployment solutions, including torch and vLLM, and enabling multi-instance deployment to accelerate the evaluation process; (4) a transparent and open leaderboard with publicly accessible, traceable, and reproducible evaluation results driven by the community to ensure transparency; and (5) official and authoritative evaluation data using widely recognized official datasets to guarantee evaluation fairness and standardization, ensuring result comparability and reproducibility (2023-11-24).
IFEvalgoogle-researchInstruction Following EvalFollowing natural language instructions is a core capability of large language models. However, the evaluation of this capability lacks standardization: human evaluation is expensive, slow, and lacks objective reproducibility, while automated evaluation based on LLMs may be biased by the evaluator LLM's capabilities or limitations. To address these issues, researchers at Google introduced Instruction Following Evaluation (IFEval), a simple and reproducible benchmark focusing on a set of "verifiable instructions," such as "write over 400 words" and "mention the AI keyword at least 3 times." IFEval identifies 25 such verifiable instructions and constructs approximately 500 prompts, each containing one or more verifiable instructions (2023-11-15).
LLMBarprinceton-nlpLLMBarLLMBar is a challenging meta-evaluation benchmark designed to test the ability of LLM evaluators to identify instruction-following outputs. It contains 419 instances, each consisting of an instruction and two outputs: one faithfully and correctly following the instruction, and the other deviating from it. Each instance also includes a gold label indicating which output is objectively better (2023-10-29).
HalluQAFudan, Shanghai AI LabHalluQAHalluQA is a Chinese LLM hallucination evaluation benchmark, featuring 450 data points including 175 misleading entries, 69 hard misleading entries, and 206 knowledge-based entries. Each question has an average of 2.8 correct and incorrect answers annotated. To enhance the usability of HalluQA, the authors designed a GPT-4-based evaluation method. Specifically, hallucination criteria and correct answers are input as instructions to GPT-4, which evaluates whether the model's response contains hallucinations.
FMTIStanfordFMTIThe Foundation Model Transparency Index (FMTI) evaluates the transparency of developers in model training and deployment across 100 indicators, including data, computational resources, and labor. Evaluations of flagship models from 10 companies reveal an average transparency score of only 37/100, indicating significant room for improvement.
ColossalEvalColossal-AIColossalEvalA project by Colossal-AI offering a unified evaluation workflow for assessing language models on public datasets or custom datasets using traditional metrics and GPT-assisted evaluations.
LLMEval²-WideDeepAlibaba ResearchLLMEval²Constructed as the largest and most diverse English evaluation benchmark for LLM evaluators, featuring 15 tasks, 8 capabilities, and 2,553 samples. Experimental results indicate that a wider network (involving many reviewers) with two layers (one round of discussion) performs best, improving the Kappa correlation coefficient from 0.28 to 0.34. WideDeep is also utilized to assist in evaluating Chinese LLMs, accelerating the evaluation process by 4.6 times and reducing costs by 60%.
AviaryRay ProjectAviaryEnables interaction with various large language models (LLMs) in one place. Direct comparison of different model outputs, ranking by quality, and obtaining cost and latency estimates are supported. It particularly supports models hosted on Hugging Face and in many cases, also supports DeepSpeed inference acceleration.
Do-Not-AnswerLibr-AIDo-Not-AnswerAn open-source dataset designed to evaluate the safety mechanisms of LLMs at a low cost. It consists of prompts that responsible language models should not respond to. In addition to human annotations, it implements model-based evaluation, where a BERT-like evaluator fine-tuned 600 million times achieves results comparable to humans and GPT-4.
LucyEvalOracleLucyEvalChinese LLM maturity evaluation—LucyEval can objectively test various aspects of model capabilities, identify model shortcomings, and help designers and engineers more accurately adjust and train models, aiding LLMs in advancing toward greater intelligence.
ZhujiuInstitute of Automation, CASZhujiuCovers seven capability dimensions and 51 tasks; employs three complementary evaluation methods; offers comprehensive Chinese benchmarking with English evaluation capabilities.
ChatEvalTHU-NLPChatEvalChatEval aims to simplify the human evaluation process of generated text. Given different text fragments, roles (played by master's students) in ChatEval can autonomously discuss nuances and differences, providing judgments based on their designated roles.
FlagEvalZhiyuan/TsinghuaFlagEvalProduced by Zhiyuan, combining subjective and objective scoring to offer LLM score rankings.
InfoQ Comprehensive LLM EvaluationInfoQInfoQ EvaluationChinese-oriented ranking: ChatGPT > Wenxin Yiyang > Claude > Xinghuo.
Chain-of-Thought EvaluationYao FuCOT EvaluationIncludes rankings for GSM8k and MATH complex problems.
Z-BenchTrue FundZ-BenchIndicates that domestic Chinese models have relatively low programmability, with minimal performance differences between models. The two versions of ChatGLM show significant improvement.
CMU Chatbot EvaluationCMUzeno-buildIn conversational training scenarios, rankings show ChatGPT > Vicuna > others.
lmsys-arenaBerkeleylmsys RankingUtilizes Elo scoring mechanism, with rankings showing GPT4 > Claude > GPT3.5 > Vicuna > others.
Huggingface Open LLM LeaderboardHuggingfaceHF Open LLM LeaderboardOrganized by Huggingface, this leaderboard evaluates multiple mainstream open-source LLMs. Evaluations focus on four datasets: AI2 Reasoning Challenge, HellaSwag, MMLU, and TruthfulQA, primarily in English.
AlpacaEvaltatsu-labAlpacaEvalOpen-source model leader where Vicuna, OpenChat, and WizardLM lead based on LLM-based automatic evaluations.
Chinese-LLM-Benchmarkjeinlee1991llm-benchmarkChinese LLM capability evaluation rankings covering Baidu Ernie Bot, ChatGPT, Alibaba Tongyi Qianwen, iFLYTEK Xinghuo, and open-source models like Belle and ChatGLM6B. It provides capability score rankings and original model output results.
Open LLM LeaderboardHuggingFaceLeaderboardOrganized by HuggingFace to evaluate multiple mainstream open-source LLMs. Evaluations primarily focus on four datasets: AI2 Reasoning Challenge, HellaSwag, MMLU, and TruthfulQA, mainly in English.
Stanford Question Answering Dataset (SQuAD)Stanford NLP GroupSQuADEvaluates model performance on reading comprehension tasks.
Multi-Genre Natural Language Inference (MultiNLI)New York University, DeepMind, Facebook AI Research, Allen Institute for AI, Google AI LanguageMultiNLIEvaluates the model's ability to understand sentence relationships across different text genres.
LogiQATsinghua University and Microsoft Research AsiaLogiQAEvaluates the model's logical reasoning capabilities.
HellaSwagUniversity of Washington and Allen Institute for AIHellaSwagEvaluates the model's reasoning capabilities.
The LAMBADA DatasetUniversity of Trento and Fondazione Bruno KesslerLAMBADAEvaluates the model's ability to predict the last word of a paragraph, reflecting long-term understanding capabilities.
CoQAStanford NLP GroupCoQAEvaluates the model's ability to understand text paragraphs and answer a series of interrelated questions in conversational settings.
ParlAIFacebook AI ResearchParlAIEvaluates model performance in accuracy, F1 score, perplexity (the model's ability to predict the next word in a sequence), human evaluation (relevance, fluency, and coherence), speed and resource utilization, robustness (model performance under varying conditions such as noisy inputs, adversarial attacks, or changes in data quality), and generalization capabilities.
Language Interpretability Tool (LIT)GoogleLITProvides a platform for evaluating models based on user-defined metrics, analyzing model strengths, weaknesses, and potential biases.
Adversarial NLI (ANLI)Facebook AI Research, New York University, Johns Hopkins University, University of Maryland, Allen Institute for AIAdversarial NLI (ANLI)Evaluates the model's robustness, generalization capabilities, reasoning explanation abilities, consistency, and resource efficiency (memory usage, inference time, and training time).

Domain

NameInstitutionFieldURLIntroduction
SeismometerEpicHealthcareseismometeSeismometer is an AI model performance evaluation tool for the healthcare field, providing standardized evaluation criteria to help make decisions based on local data and workflows. It supports continuous monitoring of model performance. Although it can be used for models in any field, it was designed with a focus on validation for healthcare AI models where local validation requires cross-referencing data about patients (such as demographics, clinical interventions, and patient outcomes) and model performance. (2024-05-22)
MedbenchOpenMEDLabHealthcaremedbenchMedBench is committed to creating a scientific, fair, and rigorous evaluation system and open platform for Chinese medical large models. Based on authoritative medical standards, it continuously updates and maintains high-quality medical datasets to comprehensively and multi-dimensionally quantify the capabilities of models across various medical dimensions. MedBench comprises 40,041 questions sourced from authentic examination exercises and medical reports of diverse branches of medicine. It is composed of four key components: the Chinese Medical Licensing Examination, the Resident Standardization Training Examination, the Doctor In-Charge Qualification Examination, and real-world clinic cases encompassing examinations, diagnoses, and treatments. (2023-12-20)
Fin-EvaAnt Group, Shanghai University of Finance and EconomicsFinanceFin-EvaFin-Eva Version 1.0, jointly launched by Ant Group and Shanghai University of Finance and Economics, covers multiple financial scenarios such as wealth management, insurance, and investment research, as well as financial specialty disciplines, with a total of over 13,000 evaluation questions. Ant’s data sources include data from various business fields and publicly available internet data. After processes such as data desensitization, text clustering, corpus screening, and data rewriting, it is combined with reviews from financial experts to construct the dataset. Shanghai University of Finance and Economics’ data sources are primarily based on real questions and simulated questions from authoritative exams in relevant fields, following the requirements of knowledge outlines. Ant’s section covers five major capabilities in finance cognition, financial knowledge, financial logic, content generation, and safety compliance, with 33 sub-dimensions and 8,445 evaluation questions; Shanghai University of Finance and Economics’ section covers four major areas: finance, economics, accounting, and certificates, including 4,661 questions across 34 different disciplines. Fin-Eva Version 1.0 adopts multiple-choice questions with fixed answers, accompanied by corresponding instructions to enable models to output in a standard format (2023-12-20)
GenMedicalEvalSJTUHealthcareGenMedicalEval1. Large-scale comprehensive performance evaluation: GenMedicalEval constructs a total of over 100,000 medical evaluation data covering 16 major departments, 3 stages of physician training, 6 medical clinical application scenarios, based on over 40,000 medical examination real questions and over 55,000 patient medical records from top-tier hospitals. This dataset comprehensively evaluates the overall performance of large models in real medical complex scenarios from aspects such as medical basic knowledge, clinical application, and safety standards, addressing the shortcomings of existing evaluation benchmarks that fail to cover many practical challenges in medical practice. 2. In-depth multi-dimensional scenario evaluation: GenMedicalEval integrates physicians’ clinical notes and medical imaging materials, building a series of diverse and theme-rich generative evaluation questions around key medical scenarios such as examination, diagnosis, and treatment. This provides a strong supplement to existing question-and-answer based evaluations that simulate real clinical environments for open diagnostic processes. 3. Innovative open evaluation metrics and automated evaluation models: To address the challenge of lacking effective evaluation metrics for open generative tasks, GenMedicalEval employs advanced structured extraction and terminology alignment techniques to build an innovative generative evaluation metric system. This system accurately measures the medical knowledge accuracy of generated answers. Furthermore, it trains a medical automatic evaluation model based on its self-built knowledge base, which has a high correlation with human evaluations. The model provides multi-dimensional medical scores and evaluation reasons. Its features include no data leakage and being controllable, giving it unique advantages compared to other models like GPT-4 (2023-12-08)
OpenFinDataShanghai Artificial Intelligence LaboratoryFinanceOpenFinDataOpenFinData, the first full-scenario financial evaluation dataset based on the "OpenCompass" framework, released by the Shanghai Artificial Intelligence Laboratory, comprises six modules and nineteen financial task dimensions, covering multi-level data types and diverse financial scenarios. Each piece of data originates from actual financial business scenarios (2024-01-04)
LAiWSichuan UniversityLegalLAiWFrom a legal perspective and feasibility, LAiW categorizes the capabilities of legal NLP into three major abilities, totaling 13 basic tasks: (1) Legal NLP basic abilities: evaluates the capabilities of legal basic tasks, NLP basic tasks, and legal information extraction, including legal clause recommendation, element recognition, named entity recognition, judicial point summarization, and case identification, five basic tasks; (2) Basic legal application abilities: evaluates the basic application capabilities of large models in the legal field, including争议焦点挖掘, case matching, criminal judgment prediction, civil judgment prediction, and legal Q&A, five basic tasks; (3) Complex legal application abilities: evaluates the complex application capabilities of large models in the legal field, including judicial reasoning generation, case understanding, and legal consultation, three basic tasks (2023-10-08)
LawBenchNanjing UniversityLegalLawBenchLawBench is meticulously designed to precisely evaluate the legal capabilities of large language models. When designing test tasks, it simulates three dimensions of judicial cognition and selects 20 tasks to assess the capabilities of large models. Compared to some existing benchmarks that only have multiple-choice questions, LawBench includes more task types closely related to real-world applications, such as legal entity recognition, reading comprehension, crime amount calculation, and consultation. LawBench recognizes that the current safety strategies of large models may lead to models refusing to respond to certain legal inquiries or encountering difficulties in understanding instructions, resulting in a lack of responses. Therefore, LawBench has developed a separate evaluation metric, the "abstention rate," to measure the frequency of models refusing to provide answers or failing to correctly understand instructions. Researchers have evaluated the performance of 51 large language models on LawBench, including 20 multilingual models, 22 Chinese models, and 9 legal-specific large language models (2023-09-28)
PsyEvalSJTUPsychologicalPsyEvalIn mental health research, the use of large language models (LLMs) is gaining increasing attention, especially their significant capabilities in disease detection. Researchers have custom-designed the first comprehensive benchmark for the mental health field to systematically evaluate the capabilities of LLMs in this domain. This benchmark includes six sub-tasks covering three dimensions to comprehensively assess the capabilities of LLMs in mental health. Corresponding concise prompts have been designed for each sub-task, and eight advanced LLMs have been comprehensively evaluated (2023-11-15)
PPTCMicrosoft, PKUOfficePPTCPPTC is a benchmark for testing the capabilities of large models in PPT generation, comprising 279 multi-turn conversations covering different topics and hundreds of instructions involving multi-modal operations. The research team has also proposed the PPTX-Match evaluation system, which assesses whether large language models have completed instructions based on predicted files rather than label API sequences. Therefore, it supports various LLM-generated API sequences. Currently, PPT generation faces three main challenges: error accumulation in multi-turn conversations, processing long PPT templates, and multi-modal perception issues (2023-11-04)
LLMRecAlibabaRecommendationLLMRecBenchmark testing of popular LLMs (such as ChatGPT, LLaMA, ChatGLM, etc.) has been conducted on five recommendation-related tasks, including rating prediction, sequential recommendation, direct recommendation, explanation generation, and review summarization. Additionally, the effectiveness of supervised fine-tuning to enhance the instruction-following capabilities of LLMs has been studied (2023-10-08)
LAiWDai-shenLegalLAiWIn response to the rapid development of legal large language models, the first Chinese legal large language model benchmark based on legal capabilities has been proposed. Legal capabilities are divided into three levels: basic legal natural language processing capabilities, basic legal application capabilities, and complex legal application capabilities. The first phase of evaluation has been completed, focusing on the assessment of basic legal natural language processing capabilities. The evaluation results show that while some legal large language models perform better than their base models, there is still a gap compared to ChatGPT (2023-10-25)
OpsEvalTsinghua UniversityAIOpsOpsEvalOpsEval is a comprehensive task-oriented AIOps benchmark test for large language models, assessing the proficiency of LLMs in three key scenarios: wired network operations, 5G communication operations, and database operations. These scenarios involve different capability levels, including knowledge recall, analytical thinking, and practical application. The benchmark comprises 7,200 questions in multiple-choice and Q&A formats, supporting both English and Chinese (2023-10-02)
SWE-benchprinceton-nlpSoftwareSWE-benchSWE-bench is a benchmark for evaluating the performance of large language models on real software issues collected from GitHub. Given a code repository and a problem, the task of the language model is to generate a patch that can solve the described problem
BLURBMindrank AIHealthcareBLURBBLURB includes a comprehensive benchmark test for biomedical natural language processing applications based on PubMed, as well as a leaderboard for tracking community progress. BLURB comprises six diverse tasks and thirteen publicly available datasets. To avoid overemphasizing tasks with many available datasets (e.g., named entity recognition NER), BLURB reports the macro-average across all tasks as the primary score. The BLURB leaderboard is model-agnostic; any system that can generate test predictions using the same training and development data can participate. The primary goal of BLURB is to lower the barrier to participation in biomedical natural language processing and help accelerate progress in this important field that has a positive impact on society and humanity
SmartPlaymicrosoftGamingSmartPlaySmartPlay is a large language model (LLM) benchmark designed for ease of use, offering a variety of games for testing
FinEvalSUFE-AIFLM-LabFinanceFinEvalFinEval: A collection of high-quality multiple-choice questions covering fields such as finance, economics, accounting, and certificates
GSM8KOpenAIMathematicsGSM8KGSM8K is a dataset of 8.5K high-quality linguistically diverse elementary school math word problems. GSM8K divides them into 7.5K training problems and 1K test problems. These problems require 2 to 8 steps to solve, with solutions primarily involving performing a series of basic arithmetic operations (+ - / *) to reach the final answer

RAG-Evaluation

NameInstitutionURLIntroduction
BERGENNAVERBERGENBERGEN: A benchmarking library for RAG systems focusing on question-answering (QA) to enhance the understanding and comparison of the impact of each component in a RAG pipeline. It simplifies the reproducibility and integration of new datasets and models through HuggingFace. BERGEN (BEnchmarking Retrieval-augmented GENeration) is a library to benchmark RAG systems, focusing on question-answering (QA). Inconsistent benchmarking poses a major challenge in comparing approaches and understanding the impact of each component in a RAG pipeline. BERGEN was designed to ease the reproducibility and integration of new datasets and models thanks to HuggingFace (2024-05-31)
CRAGMeta Reality LabsCRAGCRAG is an RAG benchmark comprising nearly 4,500 QA pairs and mock APIs, covering a wide range of domains and question types to inspire researchers to improve the reliability and accuracy of QA systems. It is a factual question-answering benchmark of 4,409 question-answer pairs and mock APIs to simulate web and Knowledge Graph (KG) search. CRAG is designed to encapsulate a diverse array of questions across five domains and eight question categories, reflecting varied entity popularity from popular to long-tail, and temporal dynamisms ranging from years to seconds (2024-06-07)
raga-llm-hubRAGA-AIraga-llm-hubraga-llm-hub is a comprehensive evaluation toolkit for language and learning models (LLMs). With over 100 carefully designed evaluation metrics, it is the most comprehensive platform allowing developers and organizations to effectively evaluate and compare LLMs, and establish basic safeguards for LLM and retrieval-augmented generation (RAG) applications. These tests assess various aspects such as relevance and understanding, content quality, hallucination, safety and bias, context relevance, safeguards, and vulnerability scanning, while providing a series of metric-based tests for quantitative analysis (2024-03-10)
ARESStanfordARESARES is an automatic evaluation framework for retrieval-augmented generation systems, comprising three components: (1) A set of annotated query-document-answer triplets with human preference validations for evaluation criteria such as context relevance, answer faithfulness, and/or answer relevance. There should be at least 50 examples, but preferably several hundred. (2) A small set of examples for scoring context relevance, answer faithfulness, and/or answer relevance in your system. (3) A large number of unannotated query-document-answer triplets generated by your RAG system for scoring. The ARES training process includes three steps: (1) Generating synthetic queries and answers from domain-specific paragraphs. (2) Fine-tuning LLM evaluators for scoring RAG systems by training on the synthetic data. (3) Deploying the prepared LLM evaluators to assess the performance of your RAG system on key metrics (2023-09-27)
RGBCASRGBRGB is a new corpus/benchmark for evaluating RAG in English and Chinese (RGB). It analyzes the performance of different large language models in the four basic capabilities required for RAG, including noise robustness, negative rejection, information integration, and counterfactual robustness. RGB divides the instances in the benchmark into four independent test sets based on these basic capabilities to address cases. Then, six representative LLMs were evaluated in RGB to diagnose the challenges faced by current LLMs when applying RAG. The evaluation shows that while LLMs demonstrate a certain level of noise robustness, they still face significant difficulties in negative rejection, information integration, and handling false information. The above evaluation results indicate that there is still a long way to go in effectively applying RAG to LLMs (2023-09-04)
tvalmetricsTonicAItvalmetricsThe metrics in Tonic Validate Metrics use LLM-assisted evaluation, meaning they use an LLM (e.g., gpt-4) to score different aspects of RAG application outputs. The metrics in Tonic Validate Metrics use these objects and LLM-assisted evaluation to answer questions about RAG applications. (1) Answer similarity score: How well should the RAG answer match the answer? (2) Retrieval precision: Is the retrieved context relevant to the question? (3) Augmentation precision: Does the answer contain retrieved context relevant to the question? (4) Augmentation accuracy: What is the proportion of retrieved context in the answer? (5) Answer consistency (binary): Does the answer contain any information outside the retrieved context? (6) Retrieval k-recall: For the top k context vectors, is the retrieved context a subset of the top k context vectors, and are all relevant contexts in the retrieved context part of the top k context vectors? (2023-11-11)

Agent-Capabilities

NameInstitutionURLIntroduction
SuperCLUE-AgentCLUESuperCLUE-AgentSuperCLUE-Agent is a multi-dimensional benchmark focusing on Agent capabilities, covering three core capabilities and ten basic tasks. It can be used to evaluate the performance of large language models in core Agent capabilities, including tool usage, task planning, and long- and short-term memory. Evaluation of 16 Chinese-supporting large language models found that the GPT-4 model leads significantly in core Agent capabilities for Chinese tasks. Meanwhile, representative domestic models, including open-source and closed-source models, are approaching the level of GPT-3.5 (2023-10-20)
AgentBenchTsinghua UniversityAgentBenchAgentBench is a systematic benchmark evaluation tool for assessing LLMs as intelligent agents, highlighting the performance gap between commercial LLMs and open-source competitors (2023-08-01)
AgentBench Reasoning and Decision-making Evaluation LeaderboardTHUDMAgentBenchJointly launched by Tsinghua and multiple universities, it covers the reasoning and decision-making capabilities of models in different task environments, such as shopping, home, and operating systems
ToolBench Tool Invocation EvaluationZhiyuan/TsinghuaToolBenchCompares with tool fine-tuned models and ChatGPT to provide evaluation scripts

Code-Capabilities

NameInstitutionURLIntroduction
McEvalBeihangMcEvalTo more comprehensively explore the code capabilities of large language models, this work proposes a large-scale multi-lingual multi-task code evaluation benchmark (McEval) covering 40 programming languages with 16,000 test samples. Evaluation results show that open-source models still have significant gaps compared to GPT-4 in multi-lingual programming capabilities, with most open-source models unable to surpass even GPT-3.5. Additionally, tests indicate that open-source models such as Codestral, DeepSeek-Coder, CodeQwen, and some derivative models also exhibit excellent multi-lingual capabilities. McEval is a massively multilingual code benchmark covering 40 programming languages with 16K test samples, substantially pushing the limits of code LLMs in multilingual scenarios. The benchmark includes challenging code completion, understanding, and generation evaluation tasks with finely curated massively multilingual instruction corpora McEval-Instruct. The McEval leaderboard can be found here (2024-06-11)
HumanEval-XLFloatAISuperCLUE-AgentExisting benchmarks primarily focus on translating English prompts into multi-lingual code or are limited to very restricted natural languages. These benchmarks overlook the broad field of large-scale multi-lingual NL to multi-lingual code generation, leaving an important gap in evaluating multi-lingual LLMs. To address this challenge, the authors propose HumanEval-XL, a large-scale multi-lingual code generation benchmark aimed at filling this gap. HumanEval-XL establishes connections between 23 natural languages and 12 programming languages, comprising 22,080 prompts with an average of 8.33 test cases per prompt. By ensuring parallel data across multiple NLs and PLs, HumanEval-XL provides a comprehensive evaluation platform for multi-lingual LLMs, enabling the assessment of understanding of different NLs. This work represents a pioneering step in addressing the gap in NL generalization evaluation for multi-lingual code generation (2024-02-26)
DebugBenchTsinghua UniversityDebugBenchDebugBench is an LLM debugging benchmark comprising 4,253 instances, covering four major vulnerability categories and 18 minor categories in C++, Java, and Python. To construct DebugBench, the authors collected code snippets from the LeetCode community, implanted vulnerabilities into the source data using GPT-4, and ensured strict quality checks (2024-01-09)

Multi-modal/Cross-modal

NameInstitutionURLIntroduction
ChartVLMShanghai AI LabChartVLMChartX is a multi-modal evaluation set comprising 18 types of charts, 7 chart tasks, 22 subject themes, and high-quality chart data. Additionally, the authors of this paper have developed ChartVLM, offering a new perspective for handling multi-modal tasks dependent on explainable patterns, such as reasoning tasks in the fields of charts or geometric images (2024-02-19)
ReForm-EvalFudanDISCReForm-EvalReForm-Eval is a benchmark dataset for comprehensively evaluating large visual language models. By reconstructing existing multi-modal benchmark datasets with different task formats, ReForm-Eval constructs a benchmark dataset with a unified format suitable for large model evaluation. The constructed ReForm-Eval has the following features: it spans eight evaluation dimensions, providing sufficient evaluation data for each dimension (averaging over 4,000 entries per dimension); it has a unified evaluation question format (including multiple-choice and text generation questions); it is convenient and easy to use, with reliable and efficient evaluation methods that do not rely on external services like ChatGPT; it efficiently utilizes existing data resources without requiring additional manual annotation and can be further expanded to more datasets (2023-10-24)
LVLM-eHubOpenGVLabLVLM-eHub"Multi-Modality Arena" is an evaluation platform for large multi-modal models. Following Fastchat, two anonymous models are compared side-by-side on visual question answering tasks. "Multi-Modality Arena" allows side-by-side benchmarking of visual-language models while providing image input. It supports various models such as MiniGPT-4, LLaMA-Adapter V2, LLaVA, and BLIP-2

Long Context

NameInstitutionURLIntroduction
InfiniteBenchOpenBMBInfiniteBenchUnderstanding and processing long text is an essential capability for large models to advance to a deeper level of understanding and interaction. While some large models claim to handle sequences of 100k+, there is a lack of standardized benchmark datasets. InfiniteBench addresses this by constructing a benchmark for sequences exceeding 100k+, focusing on five key capabilities of large models in handling long text: retrieval, mathematics, coding, question answering, and summarization. (1) Long Context: The average context length in InfiniteBench test data is 195k, far exceeding existing benchmarks. (2) Multi-domain and Multi-language: The benchmark includes 12 tasks in both Chinese and English, covering the five domains mentioned above. (3) Forward-looking and Challenging: The tasks in InfiniteBench are designed to match the capabilities of the strongest current models such as GPT-4 and Claude 2. (4) Realistic and Synthetic Scenarios: InfiniteBench incorporates both real-world data to test the model’s ability to handle practical problems and synthetic data to facilitate the expansion of context windows for testing. InfiniteBench is the first LLM benchmark featuring an average data length surpassing 100K tokens. It comprises synthetic and realistic tasks spanning diverse domains in both English and Chinese. The tasks in InfiniteBench require a thorough understanding of long dependencies in contexts, making the simple retrieval of a limited number of passages from contexts insufficient for these tasks. (2024-03-19)

Reasoning Speed

NameInstitutionURLIntroduction
llmperfRayllmperfA library for inspecting and benchmarking LLM performance. It measures metrics such as Time to First Token (TTFT), Inter-Token Latency (ITL), and the number of requests with no data returned within 3 seconds. It also validates the correctness of LLM outputs, primarily checking for cross-requests (e.g., Request A receiving the response of Request B). Variations in input and output token lengths are considered in the design to better reflect real-world scenarios. Currently supported endpoints include OpenAI-compatible endpoints (e.g., Anyscale endpoints, private endpoints, OpenAI, Fireworks, etc.), Together, Vertex AI, and SageMaker. (2023-11-03)
llm-analysisDatabricksllm-analysisLatency and Memory Analysis of Transformer Models for Training and Inference.
llm-inference-benchmarkNankai Universityllm-inference-benchmarkLLM Inference framework benchmark.
llm-inference-benchCentMLllm-inference-benchThis benchmark operates entirely external to any serving framework and can be easily extended and modified. It provides a variety of statistics and profiling modes. Designed as a standalone tool, it enables precise benchmarking with statistically significant results for specific input/output distributions. Each request consists of a single prompt and a single decoding step.
GPU-Benchmarks-on-LLM-InferenceUIUCGPU-Benchmarks-on-LLM-InferenceUses llama.cpp to test the inference speed of LLaMA models on different GPUs, including RunPod, 16-inch M1 Max MacBook Pro, M2 Ultra Mac Studio, 14-inch M3 MacBook Pro, and 16-inch M3 Max MacBook Pro.

Quantization-and-Compression

NameInstitutionURLIntroduction
LLM-QBenchBeihang/SenseTimeLLM-QBenchLLM-QBench is a benchmark for post-training quantization of large language models and serves as an efficient LLM compression tool with various advanced compression methods. It supports multiple inference backends. (2024-05-09)

Demos

  • Chat Arena: anonymous models side-by-side and vote for which one is better - An open-source AI LLM "anonymous" arena! Here, you can become a judge, score two model responses without knowing their identities, and after scoring, the true identities of the models will be revealed. Participants include Vicuna, Koala, OpenAssistant (oasst), Dolly, ChatGLM, StableLM, Alpaca, LLaMA, and more.

image

Leaderboards

PlatformAccess
ACLUE[Source
AgentBench[Source]
AlpacaEval[Source]
ANGO[Source]
BeHonest[Source]
Big Code Models Leaderboard[Source]
Chatbot Arena[Source]
Chinese Large Model Leaderboard[Source]
CLEVA[Source]
CompassRank[Source]
CompMix[Source]
C-Eval[Source]
DreamBench++[Source]
FELM[Source]
FlagEval[Source]
Hallucination Leaderboard[Source]
HELM[Source]
Huggingface Open LLM Leaderboard[Source]
Huggingface LLM Perf Leaderboard[Source]
Indico LLM Leaderboard[Source]
InfiBench[Source]
InterCode[Source]
LawBench[Source]
LLMEval[Source]
LLM Rankings[Source]
LLM Use Case Leaderboard[Source]
LucyEval[Source]
M3CoT[Source]
MMLU by Task Leaderboard[Source]
MMToM-QA[Source]
MathEval[Source]
OlympicArena[Source]
OpenEval[Source]
Open Multilingual LLM Eval[Source]
PubMedQA[Source]
SafetyBench[Source]
SciBench[Source]
SciKnowEval[Source]
SEED-Bench[Source]
SuperBench[Source]
SuperCLUE[Source]
SuperGLUE[Source]
SuperLim[Source]
TAT-DQA[Source]
TAT-QA[Source]
TheoremOne LLM Benchmarking Metrics[Source]
Toloka[Source]
Toolbench[Source]
VisualWebArena[Source]
We-Math[Source]
WHOOPS![Source]
Provider (link to pricing)OpenAIOpenAIAnthropicGoogleReplicateDeepSeekMistralAnthropicGoogleMistralCohereAnthropicMistralReplicateMistralOpenAIGroqOpenAIMistralAnthropicGroqAnthropicAnthropicMicrosoftMicrosoftMistral
Model nameGPT-4oGPT-4 TurboClaude 3 OpusGemini 1.5 ProLlama 3 70BDeepSeek-V2Mixtral 8x22BClaude 3 SonnetGemini 1.5 FlashMistral LargeCommand R+Claude 3 HaikuMistral SmallLlama 3 8BMixtral 8x7BGPT-3.5 TurboLlama 3 70B (Groq)GPT-4Mistral MediumClaude 2.0Mixtral 8x7B (Groq)Claude 2.1Claude InstantPhi-Medium 4kPhi-3-Small 8kMistral 7B
Column Last Updated5/14/20245/14/20245/14/20245/14/20245/20/20245/20/20245/20/20245/14/20245/14/20245/20/20245/20/20245/14/20245/20/20245/21/20245/14/20245/14/20245/14/20245/14/20245/20/20245/14/20245/14/20245/14/20245/14/20245/21/20245/21/20245/22/2024
CAPABILITY
Artificial Analysis Index1009494888882817876757472716565658883736965636339
LMSys Chatbot Arena ELO13101257125612491208120411581193118211541114110212081189114811261114111511041006
General knowledge:
MMLU88.7086.4086.8081.9082.0078.5077.7579.0078.9081.2075.7075.2072.2068.4070.6070.0082.0086.4075.3078.5070.6073.4078.0075.7062.50
Math:
MATH76.6073.4060.1058.5050.4043.1054.9045.0038.9030.0034.1050.4052.90
MGSM / GSM8K90.5088.6095.0093.0092.3088.9079.6057.1093.0092.0091.0089.60
Reasoning:
GPQA53.6049.1050.4041.5039.5040.4039.5033.3034.2028.1039.5035.70
BIG-BENCH-HARD86.8084.0082.9085.5073.7066.6083.10
DROP, F1 Score83.4085.4083.1078.9078.4064.1080.90
HellaSwag95.4089.0089.2085.9086.9086.7085.5095.3088.0086.7082.4077.00
Code:
HumanEval90.2087.6084.9071.9081.7073.0075.9062.2048.1081.7067.0062.2061.00
Natural2Code77.7077.20
Conversational:
MT Bench93.2083.0083.9086.1080.6083.0081.8078.5068.40
Benchmark Avg Not useful - selection bias has significant impact.80.5080.5380.3169.2569.3278.5077.7572.3367.2071.8075.7068.7879.5554.8880.1059.7269.3274.1683.1379.5580.1081.8075.9578.4075.8365.45
THROUGHPUT
Throughput (median tokens/sec)90.4021.1025.6046.2026.3015.6076.5061.30161.7032.3042.20116.7080.7075.4060.0058.60305.3028.3018.3037.20477.1042.1085.7062.20
Throughput (median seconds per 1K tokens)11.0647.3939.0621.6538.0264.1013.0716.316.1830.9623.708.5712.3913.2616.6717.063.2835.3454.6426.882.1023.7511.6716.08
COST
Cost Input (1M tokens) aka "context window tokens"$5.00$10.00$15.00$7.00$0.65$0.14$2.00$3.00$0.35$4.00$3.00$0.25$1.00$0.05$0.70$0.50$0.59$30.00$2.70$8.00$0.27$8.00$0.80$0.25
Cost Output (1M tokens)$15.00$30.00$75.00$21.00$2.75$0.28$6.00$15.00$0.53$12.00$15.00$1.25$3.00$0.25$0.70$1.50$0.79$60.00$8.10$24.00$0.27$24.00$2.40$0.25
Cost 1M Input + 1M Output tokens$20.00$40.00$90.00$28.00$3.40$0.42$8.00$18.00$0.88$16.00$18.00$1.50$4.00$0.30$1.40$2.00$1.38$90.00$10.80$32.00$0.54$32.00$3.20$0.50
COST VS PERFORMANCE
Cost 1M+1M IO tokens per AA Index point$0.20$0.43$0.96$0.32$0.04$0.01$0.10$0.23$0.01$0.21$0.24$0.02$0.06$0.00$0.02$0.03$0.02$1.08$0.15$0.46$0.01$0.51$0.05$0.01
Cost 1M+1M IO tokens per Chatbot ELO point$0.02$0.03$0.07$0.02$0.00#DIV/0!#DIV/0!$0.01#DIV/0!$0.01$0.02$0.00#DIV/0!$0.00$0.00$0.00$0.00$0.08$0.01$0.03$0.00$0.03$0.00$0.00
Cost 1M+1M IO tokens per Throughput (tokens/sec)$0.22$1.90$3.52$0.61$0.13$0.03$0.10$0.29$0.01$0.50$0.43$0.01$0.05$0.00$0.02$0.03$0.00$3.18$0.59$0.86$0.00$0.76$0.04$0.01
SPECS
Context Window (k)1281282001,000832652001,000321282003283216832100322001004833
Max Output Tokens (k)44484844444
Rate Limit (requests / minute)tieredtieredtiered5600tiered360tiered600tiered30tieredtiered30tieredtiered
Rate Limit (requests / day)tieredtieredtiered2,000tiered10,000tieredtiered14,400tieredtiered14,400tieredtiered
Rate Limit (tokens / minute)tieredtieredtiered10,000,000tiered10,000,000tieredtiered6,000tieredtiered5,000tieredtiered



Papers



LLM-List

Typical LLM details

模型ParameterLayersAtten headsDimensionLearning ratebatch sizetrain tokens
LLaMA26.7B323240963.00E-04400万1.0万亿
LLaMA213.0B404051203.00E-04400万1.0万亿
LLaMA232.5B605266561.50E-04400万1.4万亿
LLaMA265.2B806481921.50E-04400万1.4万亿
nano-GPT85,584337683.00E-04
GPT2-small0.12B12127682.50E-04
GPT2-XL1.5B482516001.50E-04
GPT3175B9696122881.50E-040.5万亿

Pre-trained-LLM

ModelSizeArchitectureAccessDateOrigin
Switch Transformer1.6TDecoder(MOE)-2021-01Paper
GLaM1.2TDecoder(MOE)-2021-12Paper
PaLM540BDecoder-2022-04Paper
MT-NLG530BDecoder-2022-01Paper
J1-Jumbo178BDecoderapi2021-08Paper
OPT175BDecoderapi | ckpt2022-05Paper
BLOOM176BDecoderapi | ckpt2022-11Paper
GPT 3.0175BDecoderapi2020-05Paper
LaMDA137BDecoder-2022-01Paper
GLM130BDecoderckpt2022-10Paper
YaLM100BDecoderckpt2022-06Blog
LLaMA65BDecoderckpt2022-09Paper
GPT-NeoX20BDecoderckpt2022-04Paper
UL220Bagnosticckpt2022-05Paper
T511BEncoder-Decoderckpt2019-10Paper
CPM-Bee10BDecoderapi2022-10Paper
rwkv-47BRWKVckpt2022-09Github
GPT-J6BDecoderckpt2022-09Github
GPT-Neo2.7BDecoderckpt2021-03Github
GPT-Neo1.3BDecoderckpt2021-03Github



Instruction-finetuned-LLM

ModelSizeArchitectureAccessDateOrigin
Flan-PaLM540BDecoder-2022-10Paper
BLOOMZ176BDecoderckpt2022-11Paper
InstructGPT175BDecoderapi2022-03Paper
Galactica120BDecoderckpt2022-11Paper
OpenChatKit20B-ckpt2023-3-
Flan-UL220BDecoderckpt2023-03Blog
Gopher-----
Chinchilla-----
Flan-T511BEncoder-Decoderckpt2022-10Paper
T011BEncoder-Decoderckpt2021-10Paper
Alpaca7BDecoderdemo2023-03Github



Aligned-LLM

ModelSizeArchitectureAccessDateOrigin
GPT 4---2023-03Blog
ChatGPT-Decoderdemo|api2022-11Blog
Sparrow70B--2022-09Paper
Claude--demo|api2023-03Blog



Open-LLM

  • LLaMA - A foundational, 65-billion-parameter large language model. LLaMA.cpp Lit-LLaMA

    • Alpaca - A model fine-tuned from the LLaMA 7B model on 52K instruction-following demonstrations. Alpaca.cpp Alpaca-LoRA
    • Flan-Alpaca - Instruction Tuning from Humans and Machines.
    • Baize - Baize is an open-source chat model trained with LoRA. It uses 100k dialogs generated by letting ChatGPT chat with itself.
    • Cabrita - A portuguese finetuned instruction LLaMA.
    • Vicuna - An Open-Source Chatbot Impressing GPT-4 with 90% ChatGPT Quality.
    • Llama-X - Open Academic Research on Improving LLaMA to SOTA LLM.
    • Chinese-Vicuna - A Chinese Instruction-following LLaMA-based Model.
    • GPTQ-for-LLaMA - 4 bits quantization of LLaMA using GPTQ.
    • GPT4All - Demo, data, and code to train open-source assistant-style large language model based on GPT-J and LLaMa.
    • Koala - A Dialogue Model for Academic Research
    • BELLE - Be Everyone's Large Language model Engine
    • StackLLaMA - A hands-on guide to train LLaMA with RLHF.
    • RedPajama - An Open Source Recipe to Reproduce LLaMA training dataset.
    • Chimera - Latin Phoenix.
  • BLOOM - BigScience Large Open-science Open-access Multilingual Language Model BLOOM-LoRA

    • BLOOMZ&mT0 - a family of models capable of following human instructions in dozens of languages zero-shot.
    • Phoenix
  • T5 - Text-to-Text Transfer Transformer

    • T0 - Multitask Prompted Training Enables Zero-Shot Task Generalization
  • OPT - Open Pre-trained Transformer Language Models.

  • UL2 - a unified framework for pretraining models that are universally effective across datasets and setups.

  • GLM- GLM is a General Language Model pretrained with an autoregressive blank-filling objective and can be finetuned on various natural language understanding and generation tasks.

  • RWKV - Parallelizable RNN with Transformer-level LLM Performance.

    • ChatRWKV - ChatRWKV is like ChatGPT but powered by my RWKV (100% RNN) language model.
  • StableLM - Stability AI Language Models.

  • YaLM - a GPT-like neural network for generating and processing text. It can be used freely by developers and researchers from all over the world.

  • GPT-Neo - An implementation of model & data parallel GPT3-like models using the mesh-tensorflow library.

  • GPT-J - A 6 billion parameter, autoregressive text generation model trained on The Pile.

    • Dolly - a cheap-to-build LLM that exhibits a surprising degree of the instruction following capabilities exhibited by ChatGPT.
  • Pythia - Interpreting Autoregressive Transformers Across Time and Scale

    • Dolly 2.0 - the first open source, instruction-following LLM, fine-tuned on a human-generated instruction dataset licensed for research and commercial use.
  • OpenFlamingo - an open-source reproduction of DeepMind's Flamingo model.

  • Cerebras-GPT - A Family of Open, Compute-efficient, Large Language Models.

  • GALACTICA - The GALACTICA models are trained on a large-scale scientific corpus.

    • GALPACA - GALACTICA 30B fine-tuned on the Alpaca dataset.
  • Palmyra - Palmyra Base was primarily pre-trained with English text.

  • Camel - a state-of-the-art instruction-following large language model designed to deliver exceptional performance and versatility.

  • h2oGPT

  • PanGu-α - PanGu-α is a 200B parameter autoregressive pretrained Chinese language model develped by Huawei Noah's Ark Lab, MindSpore Team and Peng Cheng Laboratory.

  • Open-Assistant - a project meant to give everyone access to a great chat based large language model.

  • HuggingChat - Powered by Open Assistant's latest model – the best open source chat model right now and @huggingface Inference API.

  • Baichuan - An open-source, commercially available large-scale language model developed by Baichuan Intelligent Technology following Baichuan-7B, containing 13 billion parameters. (20230715)

  • Qwen - Qwen-7B is the 7B-parameter version of the large language model series, Qwen (abbr. Tongyi Qianwen), proposed by Alibaba Cloud. Qwen-7B is a Transformer-based large language model, which is pretrained on a large volume of data, including web texts, books, codes, etc. (20230803)



Model#Author#Link#ParameterBase Model#Layer#Encoder#Decoder#Pretrain Tokens#IFT SampleRLHF
GPT3-Adabrown2020languagehttps://platform.openai.com/docs/models/gpt-30.35B-24-24---
Pythia-1Bbiderman2023pythiahttps://huggingface.co/EleutherAI/pythia-1b1B-16-16300B tokens--
GPT3-Babbagebrown2020languagehttps://platform.openai.com/docs/models/gpt-31.3B-24-24---
GPT2-XLradford2019languagehttps://huggingface.co/gpt2-xl1.5B-48-4840B tokens--
BLOOM-1b7scao2022bloomhttps://huggingface.co/bigscience/bloom-1b71.7B-24-24350B tokens--
BLOOMZ-1b7muennighoff2022crosslingualhttps://huggingface.co/bigscience/bloomz-1b71.7BBLOOM-1b724-24-8.39B tokens-
Dolly-v2-3b2023dollyhttps://huggingface.co/databricks/dolly-v2-3b2.8BPythia-2.8B32-32-15K-
Pythia-2.8Bbiderman2023pythiahttps://huggingface.co/EleutherAI/pythia-2.8b2.8B-32-32300B tokens--
BLOOM-3bscao2022bloomhttps://huggingface.co/bigscience/bloom-3b3B-30-30350B tokens--
BLOOMZ-3bmuennighoff2022crosslingualhttps://huggingface.co/bigscience/bloomz-3b3BBLOOM-3b30-30-8.39B tokens-
StableLM-Base-Alpha-3B2023StableLMhttps://huggingface.co/stabilityai/stablelm-base-alpha-3b3B-16-16800B tokens--
StableLM-Tuned-Alpha-3B2023StableLMhttps://huggingface.co/stabilityai/stablelm-tuned-alpha-3b3BStableLM-Base-Alpha-3B16-16-632K-
ChatGLM-6Bzeng2023glm-130b,du2022glmhttps://huggingface.co/THUDM/chatglm-6b6B-2828281T tokens\checkmark\checkmark
DoctorGLMxiong2023doctorglmhttps://github.com/xionghonglin/DoctorGLM6BChatGLM-6B282828-6.38M-
ChatGLM-MedChatGLM-Medhttps://github.com/SCIR-HI/Med-ChatGLM6BChatGLM-6B282828-8K-
GPT3-Curiebrown2020languagehttps://platform.openai.com/docs/models/gpt-36.7B-32-32---
MPT-7B-ChatMosaicML2023Introducinghttps://huggingface.co/mosaicml/mpt-7b-chat6.7BMPT-7B32-32-360K-
MPT-7B-InstructMosaicML2023Introducinghttps://huggingface.co/mosaicml/mpt-7b-instruct6.7BMPT-7B32-32-59.3K-
MPT-7B-StoryWriter-65k+MosaicML2023Introducinghttps://huggingface.co/mosaicml/mpt-7b-storywriter6.7BMPT-7B32-32-\checkmark-
Dolly-v2-7b2023dollyhttps://huggingface.co/databricks/dolly-v2-7b6.9BPythia-6.9B32-32-15K-
h2ogpt-oig-oasst1-512-6.9b2023h2ogpthttps://huggingface.co/h2oai/h2ogpt-oig-oasst1-512-6.9b6.9BPythia-6.9B32-32-398K-
Pythia-6.9Bbiderman2023pythiahttps://huggingface.co/EleutherAI/pythia-6.9b6.9B-32-32300B tokens--
Alpaca-7Balpacahttps://huggingface.co/tatsu-lab/alpaca-7b-wdiff7BLLaMA-7B32-32-52K-
Alpaca-LoRA-7B2023alpacalorahttps://huggingface.co/tloen/alpaca-lora-7b7BLLaMA-7B32-32-52K-
Baize-7Bxu2023baizehttps://huggingface.co/project-baize/baize-lora-7B7BLLaMA-7B32-32-263K-
Baize Healthcare-7Bxu2023baizehttps://huggingface.co/project-baize/baize-healthcare-lora-7B7BLLaMA-7B32-32-201K-
ChatDoctoryunxiang2023chatdoctorhttps://github.com/Kent0n-Li/ChatDoctor7BLLaMA-7B32-32-167K-
HuaTuowang2023huatuohttps://github.com/scir-hi/huatuo-llama-med-chinese7BLLaMA-7B32-32-8K-
Koala-7Bkoala_blogpost_2023https://huggingface.co/young-geng/koala7BLLaMA-7B32-32-472K-
LLaMA-7Btouvron2023llamahttps://huggingface.co/decapoda-research/llama-7b-hf7B-32-321T tokens--
Luotuo-lora-7b-0.3luotuohttps://huggingface.co/silk-road/luotuo-lora-7b-0.37BLLaMA-7B32-32-152K-
StableLM-Base-Alpha-7B2023StableLMhttps://huggingface.co/stabilityai/stablelm-base-alpha-7b7B-16-16800B tokens--
StableLM-Tuned-Alpha-7B2023StableLMhttps://huggingface.co/stabilityai/stablelm-tuned-alpha-7b7BStableLM-Base-Alpha-7B16-16-632K-
Vicuna-7b-delta-v1.1vicuna2023https://github.com/lm-sys/FastChat\#vicuna-weights7BLLaMA-7B32-32-70K-
BELLE-7B-0.2M /0.6M /1M /2Mbelle2023exploringhttps://huggingface.co/BelleGroup/BELLE-7B-2M7.1BBloomz-7b1-mt30-30-0.2M/0.6M/1M/2M-
BLOOM-7b1scao2022bloomhttps://huggingface.co/bigscience/bloom-7b17.1B-30-30350B tokens--
BLOOMZ-7b1 /mt /p3muennighoff2022crosslingualhttps://huggingface.co/bigscience/bloomz-7b1-p37.1BBLOOM-7b130-30-4.19B tokens-
Dolly-v2-12b2023dollyhttps://huggingface.co/databricks/dolly-v2-12b12BPythia-12B36-36-15K-
h2ogpt-oasst1-512-12b2023h2ogpthttps://huggingface.co/h2oai/h2ogpt-oasst1-512-12b12BPythia-12B36-36-94.6K-
Open-Assistant-SFT-4-12B2023openassistanthttps://huggingface.co/OpenAssistant/oasst-sft-4-pythia-12b-epoch-3.512BPythia-12B-deduped36-36-161K-
Pythia-12Bbiderman2023pythiahttps://huggingface.co/EleutherAI/pythia-12b12B-36-36300B tokens--
Baize-13Bxu2023baizehttps://huggingface.co/project-baize/baize-lora-13B13BLLaMA-13B40-40-263K-
Koala-13Bkoala_blogpost_2023https://huggingface.co/young-geng/koala13BLLaMA-13B40-40-472K-
LLaMA-13Btouvron2023llamahttps://huggingface.co/decapoda-research/llama-13b-hf13B-40-401T tokens--
StableVicuna-13B2023StableLMhttps://huggingface.co/CarperAI/stable-vicuna-13b-delta13BVicuna-13B v040-40-613K\checkmark
Vicuna-13b-delta-v1.1vicuna2023https://github.com/lm-sys/FastChat\#vicuna-weights13BLLaMA-13B40-40-70K-
moss-moon-003-sft2023mosshttps://huggingface.co/fnlp/moss-moon-003-sft16Bmoss-moon-003-base34-34-1.1M-
moss-moon-003-sft-plugin2023mosshttps://huggingface.co/fnlp/moss-moon-003-sft-plugin16Bmoss-moon-003-base34-34-1.4M-
GPT-NeoX-20Bgptneoxhttps://huggingface.co/EleutherAI/gpt-neox-20b20B-44-44825GB--
h2ogpt-oasst1-512-20b2023h2ogpthttps://huggingface.co/h2oai/h2ogpt-oasst1-512-20b20BGPT-NeoX-20B44-44-94.6K-
Baize-30Bxu2023baizehttps://huggingface.co/project-baize/baize-lora-30B33BLLaMA-30B60-60-263K-
LLaMA-30Btouvron2023llamahttps://huggingface.co/decapoda-research/llama-30b-hf33B-60-601.4T tokens--
LLaMA-65Btouvron2023llamahttps://huggingface.co/decapoda-research/llama-65b-hf65B-80-801.4T tokens--
GPT3-Davincibrown2020languagehttps://platform.openai.com/docs/models/gpt-3175B-96-96300B tokens--
BLOOMscao2022bloomhttps://huggingface.co/bigscience/bloom176B-70-70366B tokens--
BLOOMZ /mt /p3muennighoff2022crosslingualhttps://huggingface.co/bigscience/bloomz-p3176BBLOOM70-70-2.09B tokens-
ChatGPT~(2023.05.01)openaichatgpthttps://platform.openai.com/docs/models/gpt-3-5-GPT-3.5----\checkmark\checkmark
GPT-4~(2023.05.01)openai2023gpt4https://platform.openai.com/docs/models/gpt-4------\checkmark\checkmark



Frameworks-for-Training

  • Accelerate - 🚀 A simple way to train and use PyTorch models with multi-GPU, TPU, mixed-precision.
  • Apache MXNet - Lightweight, Portable, Flexible Distributed/Mobile Deep Learning with Dynamic, Mutation-aware Dataflow Dep Scheduler.
  • Caffe - A fast open framework for deep learning.
  • ColossalAI - An integrated large-scale model training system with efficient parallelization techniques.
  • DeepSpeed - DeepSpeed is a deep learning optimization library that makes distributed training and inference easy, efficient, and effective.
  • Horovod - Distributed training framework for TensorFlow, Keras, PyTorch, and Apache MXNet.
  • Jax - Autograd and XLA for high-performance machine learning research.
  • Kedro - Kedro is an open-source Python framework for creating reproducible, maintainable and modular data science code.
  • Keras - Keras is a deep learning API written in Python, running on top of the machine learning platform TensorFlow.
  • LightGBM - A fast, distributed, high performance gradient boosting (GBT, GBDT, GBRT, GBM or MART) framework based on decision tree algorithms, used for ranking, classification and many other machine learning tasks.
  • MegEngine - MegEngine is a fast, scalable and easy-to-use deep learning framework, with auto-differentiation.
  • metric-learn - Metric Learning Algorithms in Python.
  • MindSpore - MindSpore is a new open source deep learning training/inference framework that could be used for mobile, edge and cloud scenarios.
  • Oneflow - OneFlow is a performance-centered and open-source deep learning framework.
  • PaddlePaddle - Machine Learning Framework from Industrial Practice.
  • PyTorch - Tensors and Dynamic neural networks in Python with strong GPU acceleration.
  • PyTorch Lightning - Deep learning framework to train, deploy, and ship AI products Lightning fast.
  • XGBoost - Scalable, Portable and Distributed Gradient Boosting (GBDT, GBRT or GBM) Library.
  • scikit-learn - Machine Learning in Python.
  • TensorFlow - An Open Source Machine Learning Framework for Everyone.
  • VectorFlow - A minimalist neural network library optimized for sparse data and single machine environments.



LLMOps

NameStarsDescription
Byzer-LLMByzer-LLM is a comprehensive large model infrastructure that supports capabilities related to large models, such as pre-training, fine-tuning, deployment, and serving. Byzer-Retrieval is a storage infrastructure specifically developed for large models, supporting batch import of various data sources, real-time single-item updates, and full-text, vector, and hybrid searches to facilitate data usage for Byzer-LLM. Byzer-SQL/Python offers user-friendly interactive APIs with a low barrier to entry for utilizing the aforementioned products.
agentaAn LLMOps platform for building powerful LLM applications. It allows for easy experimentation and evaluation of different prompts, models, and workflows to construct robust applications.
Arize-PhoenixML observability for LLMs, vision, language, and tabular models.
BudgetMLDeploy ML inference services on a limited budget with less than 10 lines of code.
CometLLMAn open-source LLMOps platform for logging, managing, and visualizing LLM prompts and chains. It tracks prompt templates, variables, duration, token usage, and other metadata. It also scores prompt outputs and visualizes chat history in a single UI.
deeplakeStream large multimodal datasets to achieve near 100% GPU utilization. Query, visualize, and version control data. Access data without recalculating embeddings for model fine-tuning.
DifyAn open-source framework that enables developers (even non-developers) to quickly build useful applications based on large language models, ensuring they are visible, actionable, and improvable.
DstackCost-effective LLM development in any cloud (AWS, GCP, Azure, Lambda, etc.).
EmbedchainA framework for creating ChatGPT-like robots on datasets.
GPTCacheCreate semantic caches to store responses to LLM queries.
HaystackQuickly build applications with LLM agents, semantic search, question answering, and more.
langchainBuild LLM applications through composability.
LangFlowA hassle-free way to experiment with and prototype LangChain processes using drag-and-drop components and a chat interface.
LangKitA ready-to-use LLM telemetry collection library that extracts profiles of LLM performance over time, as well as prompts, responses, and metadata, to identify issues at scale.
LiteLLM 🚅A simple and lightweight 100-line package for standardizing LLM API calls across OpenAI, Azure, Cohere, Anthropic, Replicate, and other API endpoints.
LlamaIndexProvides a central interface to connect your LLMs with external data.
LLMAppLLM App is a Python library that helps you build real-time LLM-enabled data pipelines with just a few lines of code.
LLMFlowsLLMFlows is a framework for building simple, clear, and transparent LLM applications, such as chatbots, question-answering systems, and agents.
LLMonitorObservability and monitoring for AI applications and agents. Debug agents with robust tracking and logging. Use analytical tools to delve into request history. Developer-friendly modules that can be easily integrated into LangChain.
magenticSeamlessly integrate LLMs as Python functions. Use type annotations to specify structured outputs. Combine LLM queries and function calls with regular Python code to create complex LLM-driven functionalities.
Pezzo 🕹️Pezzo is an open-source LLMOps platform built for developers and teams. With just two lines of code, you can easily troubleshoot AI operations, collaborate on and manage your prompts, and deploy changes instantly from one place.
promptfooAn open-source tool for testing and evaluating prompt quality. Create test cases, automatically check output quality, and catch regressions to reduce evaluation costs.
prompttoolsAn open-source tool for testing and trying out prompts. The core idea is to enable developers to evaluate prompts using familiar interfaces such as code and notebooks. With just a few lines of code, you can test prompts and parameters across different models (whether you're using OpenAI, Anthropic, or LLaMA models). You can even evaluate the accuracy of vector database retrievals.
TrueFoundryNo GitHub linkDeploy LLMOps tools on your own Kubernetes (EKS, AKS, GKE, On-prem) infrastructure, including Vector DBs, embedded servers, etc. This includes open-source LLM models for deployment, fine-tuning, prompt tracking, and providing complete data security and optimal GPU management. Use best software engineering practices to train and launch your LLM applications at production scale.
ReliableGPT 💪Handle OpenAI errors for your production LLM applications (overloaded OpenAI servers, rotated keys, or context window errors).
Weights & Biases (Prompts)No GitHub linkA set of LLMOps tools in the developer-focused W&B MLOps platform. Use W&B Prompts to visualize and inspect LLM execution flows, track inputs and outputs, view intermediate results, and manage prompts and LLM chain configurations.
xTuringBuild and control your personal LLMs using fast and efficient fine-tuning.
ZenMLAn open-source framework for orchestrating, experimenting, and deploying production-grade ML solutions, with built-in langchain and llama_index integration.



Courses



Other-Awesome-Lists



Licenses

MIT license

MIT License.

CC BY-NC-SA 4.0

Creative Commons Attribution-NonCommercial-ShareAlike 4.0 International License.



Citation

@misc{llm-eval-anthropomorphic,
      title={Beyond Benchmark: LLMs Evaluation with an Anthropomorphic and Value-oriented Roadmap}, 
      author={Jun Wang and Ninglun Gu and Kailai Zhang and Zijiao Zhang and Yelun Bao and Jin Yang and Xu Yin and Liwei Liu and Yihuan Liu and Pengyong Li and Gary G. Yen and Junchi Yan},
      year={2025},
      eprint={2508.18646},
      archivePrefix={arXiv},
      primaryClass={cs.AI},
      url={https://arxiv.org/abs/2508.18646}, 
}