FlagEval evaluation platform

May 16, 2025 · View on GitHub

FlagEval Logo


FlagEval, launched by BAAI in 2023, is a comprehensive large model evaluation system that encompasses over 800 open-source and closed-source models from around the globe. It features more than 40 capability dimensions, including reasoning, mathematical skills, and task-solving abilities, along with five major tasks and four categories of metrics.

🧠 FlagEval Report

作者:FlagEval

The FlagEval Report series provides in-depth insights into the evolving landscape of large-scale model evaluation. Each issue delivers a comprehensive analysis of model capabilities across diverse tasks and metrics, enabling researchers and developers better to understand the strengths and limitations of leading AI models.

Issue 2 (2024-12-30 Updated) pdf

Issue 1 (2024-07-13 Updated) pdf

🌟 FlagEval Core

ProjectScopeGitHub
FlagEvalGeneral‑purpose evaluation toolkit & platform for LLMs and multimodal foundation models; integrates >20 benchmarks across NLP, CV, Audiohttps://github.com/flageval-baai/FlagEval

🚀 Satellite Repositories

ProjectDescriptionGitHub
FlagEvalMMFlexible framework for comprehensive multimodal model evaluation across text, image, and video taskshttps://github.com/flageval-baai/FlagEvalMM
SeniorTalk55 h Mandarin speech dataset featuring 202 elderly speakers (75‑85 yrs) with rich annotationshttps://github.com/flageval-baai/SeniorTalk
ChildMandarin41 h child speech dataset covering 397 speakers (3‑5 yrs), balanced by gender & regionhttps://github.com/flageval-baai/ChildMandarin
HalluDialLarge‑scale dialogue hallucination benchmark (spontaneous + induced scenarios, 147 k turns)https://github.com/flageval-baai/HalluDial
CMMUIJCAI‑24 Chinese Multimodal Multi‑type Question benchmark (3 603 exam‑style Q&A)https://github.com/flageval-baai/CMMU

📚 Repository Matrix

RepoHighlightsWhy It MattersLicense
FlagEvalNLP/CV/Audio/Multimodal tasks; pipeline runners, leaderboard exporterOne‑stop hub for model & algorithm benchmarkingApache‑2.0
FlagEvalMMMultimodal eval harness with vLLM/SGLang adaptersReady for GPT‑4o era, supports batch evalApache‑2.0
SeniorTalkElderly speech corpusEnables ASR/TTS for super‑aged populationCC BY‑NC‑SA 4.0
ChildMandarinChild speech corpusComplements SeniorTalk, spans lifespanCC BY‑NC‑SA 4.0
HalluDialDialogue hallucination dataset & metricsFirst large‑scale hallucination localization benchmarkApache‑2.0
CMMUMultimodal Q&A examStress‑tests domain knowledge & reasoningMIT

🔭 Roadmap (2025‑2026)

  1. Continuous Benchmarking: nightly runs on FlagScale with automated PR badges and regression alerts.
  2. Community Challenges: quarterly leaderboard sprints to surface emerging research directions.

🤝 Contributing

We welcome issues & PRs! Please check each project’s CONTRIBUTING.md and adhere to its license terms.


📄 Citation

If you use any component of the ecosystem, please cite the corresponding paper listed in that project’s README.


🛡️ License

This meta‑repository is released under Apache‑2.0. Individual projects may apply different licenses—see their respective READMEs.


Maintained by the FlagEval team · Last updated: 2025‑04‑23