Supported Datasets of LLMBox
November 15, 2024 ยท View on GitHub
We currently support 59+ commonly used datasets for LLMs.
Understanding Evaluation Type
Each dataset is either a multiple-choice dataset or a generation dataset. You can find the difference between them at here
Understanding Subsets
Some datasets have multiple subsets. For example, Massive Multitask Language Understanding (mmlu) dataset contains 57 different subsets categorized into four categories: stem, social_sciences, humanities, and other.
While some other dataset is a subset of another dataset. For example, Choice Of Plausible Alternatives (copa) is a subset of super_glue.
See how to load datasets with subsets.
Understanding CoT
Some datasets support Chain-of-Thought reasoning. For example, Grade School Math 8K (gsm8k) supports three types of CoT: base, least_to_most, and pal.
Supported Datasets
- ๐ฅ Recently supported datasets:
imbue_code,imbue_public, andimbue_private. - The evaluation code can be found here: benchmarking llama3.
| Dataset | Subsets / Collections | Evaluation Type | CoT | Notes |
AGIEval(agieval, alias of agieval_single_choice and agieval_cot) |
English: sat-en, sat-math, lsat-ar, lsat-lr, lsat-rc, logiqa-en, aqua-rat, sat-en-without-passage |
MultipleChoice | ||
gaokao-chinese, gaokao-geography, gaokao-history, gaokao-biology, gaokao-chemistry, gaokao-english, logiqa-zh |
||||
jec-qa-kd, jec-qa-ca, math, gaokao-physics, gaokao-mathcloze, gaokao-mathqa |
Generation | โ | ||
Alpacal Eval (alpaca_eval) |
/ | Generation | Single GPTEval | |
Adversarial Natural Language Inference (anli) |
Round2 (default) |
MultipleChoice | ||
AI2's Reasoning Challenge (arc) |
ARC-Easy, ARC-Challenge |
MultipleChoice | Normalization | |
BIG-Bench Hard (bbh) |
boolean_expressions, ... |
Generation | โ | |
Boolean Questions (boolq) |
super_glue | MultipleChoice | ||
CommitmentBank (cb) |
super_glue | MultipleChoice | ||
C-Eval (ceval) |
stem: advanced_mathematics, college_chemistry, ... |
MultipleChoice | ||
social science: business_administration, college_economics, ... |
||||
humanities: art_studies, chinese_language_and_literature, ... |
||||
other: accountant, basic_medicine, ... |
||||
Massive Multitask Language Understanding in Chinese (cmmlu) |
stem: anatomy, astronomy, ... |
MultipleChoice | ||
social science: ancient_chinese, business_ethics, ... |
||||
humanities: arts, chinese_history, ... |
||||
other: agronomy, chinese_driving_rule, ... |
||||
CNN Dailymail (cnn_dailymail) |
3.0.0 (default), ... |
Generation | ||
Reasoning About Colored Objects (color_objects) |
bigbench (reasoning_about_colored_objects) | Generation | ||
Commonsense QA (commonsenseqa) |
/ | MultipleChoice | ||
Choice Of Plausible Alternatives (copa) |
super_glue | MultipleChoice | ||
Conversational Question Answering (coqa) |
/ | Generation | Download: train, dev | |
CrowS-Pairs (crows_pairs) |
/ | MultipleChoice | ||
Discrete Reasoning Over the content of Paragraphs (drop) |
/ | Generation | ||
GAOKAO (gaokao) |
Chinese: 2010-2022_Chinese_Modern_Lit, 2010-2022_Chinese_Lang_and_Usage_MCQs |
Generation | Metric: Exam scoring | |
English: 2010-2022_English_Reading_Comp, 2010-2022_English_Fill_in_Blanks, ... |
||||
2010-2022_Math_II_MCQs, 2010-2022_Math_I_MCQs, ... |
||||
Google-Proof Q&A (GPQA) |
gpqa_main (default), gpqa_extended, ... |
MultipleChoice | โ | Tutorial |
Grade School Math 8K (gsm8k) |
main (default), socratic |
Generation | โ | Code exec |
HaluEval(halueval) |
dialogue_samples, qa_samples, summarization_samples |
Generation | ||
HellaSWAG (hellaswag) |
/ | MultipleChoice | ||
HumanEval (humaneval) |
/ | Generation | Pass@K | |
Instruction-Following Evaluation (ifeval) |
/ | Generation | ||
Imbue Code Comprehension (imbue_code) |
/ | MultipleChoice | ||
Imbue High Quality Private Evaluations (imbue_private) |
/ | MultipleChoice | ||
Imbue High Quality Public Evaluations (imbue_public) |
/ | MultipleChoice | ||
LAnguage Modeling Broadened to Account for Discourse Aspects (lambada) |
default (default), de, ... (source: EleutherAI/lambada_openai) |
Generation | ||
Mathematics Aptitude Test of Heuristics (math) |
/ | Generation | โ | |
Mostly Basic Python Problems (mbpp) |
full (default), sanitized |
Generation | Pass@K | |
Massive Multitask Language Understanding(mmlu) |
stem: abstract_algebra, astronomy, ... |
MultipleChoice | ||
social_sciences: econometrics, high_school_geography, ... |
||||
humanities: formal_logic, high_school_european_history, ... |
||||
other: anatomy, business_ethics, ... |
||||
Multi-turn Benchmark (mt_bench) |
/ | Generation | Multi-turn GPTEval | |
Natural Questions(nq) |
/ | Generation | ||
OpenBookQA (openbookqa) |
main (default), additional |
MultipleChoice | Normalization | |
Penguins In A Table (penguins_in_a_table) |
bigbench | MultipleChoice | ||
Physical Interaction: Question Answering (piqa) |
/ | MultipleChoice | ||
Question Answering in Context (quac) |
/ | Generation | ||
ReAding Comprehension (race) |
high, middle |
MultipleChoice | Normalization | |
Real Toxicity Prompts (real_toxicity_prompts) |
/ | Generation | Perspective Toxicity | |
Recognizing Textual Entailment (rte) |
super_glue | MultipleChoice | ||
Social Interaction QA (siqa) |
/ | MultipleChoice | ||
Stanford Question Answering Dataset (squad, squad_v2) |
/ | Generation | ||
Story Cloze Test (story_cloze) |
2016 (default), 2018 |
MultipleChoice | Manually download | |
TL;DR (tldr) |
/ | Generation | ||
TriviaQA (triviaqa) |
rc.wikipedia.nocontext (default), rc, rc.nocontext, ... |
Generation | ||
TruthfulQA (truthfulqa_mc) |
multiple_choice (default), generation (not supported) |
MultipleChoice | ||
Vicuna Bench (vicuna_bench) |
/ | Generation | GPTEval | |
WebQuestions (webq) |
/ | Generation | ||
Words in Context (wic) |
super_glue | MultipleChoice | ||
Winogender Schemas (winogender) |
main, gotcha |
MultipleChoice | Group by gender | |
WSC273 (winograd) |
wsc273 (default), wsc285 |
MultipleChoice | ||
WinoGrande (winogrande) |
winogrande_debiased (default), ... |
MultipleChoice | ||
Conference on Machine Translation (wmt21, wmt19, ...) |
en-ro, ro-en, ... |
Generation | ||
Winograd Schema Challenge (wsc) |
super_glue | MultipleChoice | ||
A Multilingual Dataset for Causal Commonsense Reasoning (xcopa) |
et, zh, ... |
MultipleChoice | ||
Large-Scale Multilingual Abstractive Summarization (xlsum) |
english, french, ... |
Generation | ||
Cross-lingual Natural Language Inference (xnli) |
en, es, ... |
MultipleChoice | ||
Extreme Summarization (xsum) |
/ | Generation | ||
Crosslingual Winograd (xwinograd) |
en, fr, ... |
MultipleChoice |