BIG-Bench Extra Hard (BBEH) Leaderboard
May 6, 2025 ยท View on GitHub
The table below represents the BBEH leaderboard, sorted by harmonic mean. To contribute, send an email to mehrankazemi@google.com with your model name, link to the paper for use in the contributed by section, and the scores.
| Contributed by | BBEH | BBEH (Micro Avg) | BBEH Mini (Micro Avg) | |
|---|---|---|---|---|
| o3-mini (high) | Original paper | 44.8 | 54.2 | 56.7 |
| Gemini 2.0 Flash | Original paper | 9.8 | 23.9 | 27.0 |
| Gemini 2.0 Flash-Lite | Original paper | 8.0 | 19.7 | 22.2 |
| DeepSeek R1 | Original paper | 6.8 | 34.9 | 37.2 |
| GPT4o | Original paper | 6.0 | 22.3 | 23.5 |
| Distill R1 Qwen 32b | Original paper | 5.2 | 19.2 | 15.4 |
| Gemma3 27b | Gemma3 + Original paper | 4.9 | 18.8 | 17.4 |
| Gemma3 12b | Gemma3 + Original paper | 4.5 | 16.3 | 14.3 |
| Gemma3 4b | Gemma3 + Original paper | 3.4 | 11.0 | 13.3 |
| Gemma2 27b IT | Original paper | 4.0 | 14.8 | 15.0 |
| Llama 3.1 8b Instruct | Original paper | 3.6 | 10.6 | 11.5 |
| Random | Original paper | 2.4 | 8.4 | 8.4 |