FlagEval evaluation platform

May 16, 2025 · View on GitHub

FlagEval, launched by BAAI in 2023, is a comprehensive large model evaluation system that encompasses over 800 open-source and closed-source models from around the globe. It features more than 40 capability dimensions, including reasoning, mathematical skills, and task-solving abilities, along with five major tasks and four categories of metrics.

🧠 FlagEval Report

作者：FlagEval

The FlagEval Report series provides in-depth insights into the evolving landscape of large-scale model evaluation. Each issue delivers a comprehensive analysis of model capabilities across diverse tasks and metrics, enabling researchers and developers better to understand the strengths and limitations of leading AI models.

Issue 2 (2024-12-30 Updated) pdf

Issue 1 (2024-07-13 Updated) pdf

🌟 FlagEval Core

Project	Scope	GitHub
FlagEval	General‑purpose evaluation toolkit & platform for LLMs and multimodal foundation models; integrates >20 benchmarks across NLP, CV, Audio	https://github.com/flageval-baai/FlagEval

🚀 Satellite Repositories

Project	Description	GitHub
FlagEvalMM	Flexible framework for comprehensive multimodal model evaluation across text, image, and video tasks	https://github.com/flageval-baai/FlagEvalMM
SeniorTalk	55 h Mandarin speech dataset featuring 202 elderly speakers (75‑85 yrs) with rich annotations	https://github.com/flageval-baai/SeniorTalk
ChildMandarin	41 h child speech dataset covering 397 speakers (3‑5 yrs), balanced by gender & region	https://github.com/flageval-baai/ChildMandarin
HalluDial	Large‑scale dialogue hallucination benchmark (spontaneous + induced scenarios, 147 k turns)	https://github.com/flageval-baai/HalluDial
CMMU	IJCAI‑24 Chinese Multimodal Multi‑type Question benchmark (3 603 exam‑style Q&A)	https://github.com/flageval-baai/CMMU

📚 Repository Matrix

Repo	Highlights	Why It Matters	License
FlagEval	NLP/CV/Audio/Multimodal tasks; pipeline runners, leaderboard exporter	One‑stop hub for model & algorithm benchmarking	Apache‑2.0
FlagEvalMM	Multimodal eval harness with vLLM/SGLang adapters	Ready for GPT‑4o era, supports batch eval	Apache‑2.0
SeniorTalk	Elderly speech corpus	Enables ASR/TTS for super‑aged population	CC BY‑NC‑SA 4.0
ChildMandarin	Child speech corpus	Complements SeniorTalk, spans lifespan	CC BY‑NC‑SA 4.0
HalluDial	Dialogue hallucination dataset & metrics	First large‑scale hallucination localization benchmark	Apache‑2.0
CMMU	Multimodal Q&A exam	Stress‑tests domain knowledge & reasoning	MIT

🔭 Roadmap (2025‑2026)

Continuous Benchmarking: nightly runs on FlagScale with automated PR badges and regression alerts.
Community Challenges: quarterly leaderboard sprints to surface emerging research directions.

FlagEval evaluation platform

🧠 FlagEval Report

🌟 FlagEval Core

🚀 Satellite Repositories

📚 Repository Matrix

🔭 Roadmap (2025‑2026)

🤝 Contributing

📄 Citation

🛡️ License