FlagEval evaluation platform
May 16, 2025 · View on GitHub

FlagEval, launched by BAAI in 2023, is a comprehensive large model evaluation system that encompasses over 800 open-source and closed-source models from around the globe. It features more than 40 capability dimensions, including reasoning, mathematical skills, and task-solving abilities, along with five major tasks and four categories of metrics.
🧠 FlagEval Report
作者:FlagEval
The FlagEval Report series provides in-depth insights into the evolving landscape of large-scale model evaluation. Each issue delivers a comprehensive analysis of model capabilities across diverse tasks and metrics, enabling researchers and developers better to understand the strengths and limitations of leading AI models.
Issue 2 (2024-12-30 Updated) pdf
Issue 1 (2024-07-13 Updated) pdf
🌟 FlagEval Core
| Project | Scope | GitHub |
|---|---|---|
| FlagEval | General‑purpose evaluation toolkit & platform for LLMs and multimodal foundation models; integrates >20 benchmarks across NLP, CV, Audio | https://github.com/flageval-baai/FlagEval |
🚀 Satellite Repositories
| Project | Description | GitHub |
|---|---|---|
| FlagEvalMM | Flexible framework for comprehensive multimodal model evaluation across text, image, and video tasks | https://github.com/flageval-baai/FlagEvalMM |
| SeniorTalk | 55 h Mandarin speech dataset featuring 202 elderly speakers (75‑85 yrs) with rich annotations | https://github.com/flageval-baai/SeniorTalk |
| ChildMandarin | 41 h child speech dataset covering 397 speakers (3‑5 yrs), balanced by gender & region | https://github.com/flageval-baai/ChildMandarin |
| HalluDial | Large‑scale dialogue hallucination benchmark (spontaneous + induced scenarios, 147 k turns) | https://github.com/flageval-baai/HalluDial |
| CMMU | IJCAI‑24 Chinese Multimodal Multi‑type Question benchmark (3 603 exam‑style Q&A) | https://github.com/flageval-baai/CMMU |
📚 Repository Matrix
| Repo | Highlights | Why It Matters | License |
|---|---|---|---|
| FlagEval | NLP/CV/Audio/Multimodal tasks; pipeline runners, leaderboard exporter | One‑stop hub for model & algorithm benchmarking | Apache‑2.0 |
| FlagEvalMM | Multimodal eval harness with vLLM/SGLang adapters | Ready for GPT‑4o era, supports batch eval | Apache‑2.0 |
| SeniorTalk | Elderly speech corpus | Enables ASR/TTS for super‑aged population | CC BY‑NC‑SA 4.0 |
| ChildMandarin | Child speech corpus | Complements SeniorTalk, spans lifespan | CC BY‑NC‑SA 4.0 |
| HalluDial | Dialogue hallucination dataset & metrics | First large‑scale hallucination localization benchmark | Apache‑2.0 |
| CMMU | Multimodal Q&A exam | Stress‑tests domain knowledge & reasoning | MIT |
🔭 Roadmap (2025‑2026)
- Continuous Benchmarking: nightly runs on FlagScale with automated PR badges and regression alerts.
- Community Challenges: quarterly leaderboard sprints to surface emerging research directions.
🤝 Contributing
We welcome issues & PRs! Please check each project’s CONTRIBUTING.md and adhere to its license terms.
📄 Citation
If you use any component of the ecosystem, please cite the corresponding paper listed in that project’s README.
🛡️ License
This meta‑repository is released under Apache‑2.0. Individual projects may apply different licenses—see their respective READMEs.
Maintained by the FlagEval team · Last updated: 2025‑04‑23