Some evaluation datasets

November 26, 2024 · View on GitHub

If the task you are interested is already well studied, chances are that a dataset exists for it.

Below are a number of evaluation datasets which were developed in the last few years.

However, careful:

Some of them can be obsolete, as they were designed pre-LLM and are now easily solved: they aimed to investigate one specific property of text (translation, summarization) which is no longer really how we evaluate models (evaluations are now more general/holistic). (If you've got some bandwidth, this could really benefit from adding the publication dates!) (This will also be updated with post LLM evals at some point)
They are likely contaminated, as they have been publicly on the web for a number of years. However, it doesn't mean they won't hold signal for your task!

Math specific datasets

Evaluation name	Task type	Publication date	Data size	Task data	Task/Paper content	Source	Dataset	Comments
AGIEval (SATMath)	Exam dataset + existing datasets	2023	220	Math problems from the SAT	Paper is actually a compilation of a bunch of human relative exams to use as eval data.	Paper	HuggingFace	- Careful, this paper also includes datasets from other papers! For math, they use AIME & AMC through the MATH dataset, GRE & GMAT through the AQuA-Rat dataset, and GaoKao - Metrics: acc/em/f1
AIME (all)	Olympiad dataset	1983-now	15 x 2 per year	Mathematical problems requiring a combination of arithmetic, algebra, counting, geometry, number theory, probability and other secondary school math topics	2nd exam to choose the US team for the International Math Olympiads	Blog	Source	Answer is systematically an integer between 0 and 999.
AIME (22, 23 and 24)	Olympiad dataset	2024	90	See AIME (all)		Paper	HuggingFace	Used in the AIMO competition
ALGES (SingleEQ)	Online sources compilations	2015	508	Grade school algebra problems extracted from sources in the web	Paper is about implicitely learning and solving the simple equation behind the problem	Paper	Source	- Web sources: http://math-aids.com, http://k5learning.com, and http://ixl.com - Pre-LLM paper - data sources are probably OK
ALG514 or AllEq	Online forum	2014	514	Algebra word problems extracted from a crowdsourced tutoring website, cleaned with turking, and manually verified	Paper is about extracting the equation template from the problem to solve it	Paper	Source	- Web source: Algebra.com
AMC 12	Olympiad dataset	2000 - now	25 per year	Mathematical word problems requiring arithmetic, algebra, counting, geometry, number theory, probability and other secondary school math topics.	1st exam to select the US team for the International Math Olympiad, used to be the Americal High School Math exam	Blog	Source	- Problems are designed to be solvable by students without any background in calculus.
Ape210K	Exam dataset	2020	210K problems, 56K templates	Chinese elementary school-level math word problems written by math teachers	Solve the problems	Paper (withdrawn, but v1 still accessible)	HuggingFace	- Some problems are templated, which could be interesting for contamination issues - Initial dataset is 900K and was manually filtered - Also provides "intermediate equations" (useful to test CoT traces if needed) - Dataset is in Chinese - Intended to be partially used for training
AQUA or AQUA-Rat	Exam dataset + turked dataset	2017	100K	Algebraic word problems constructed from a seed of 34K problems from GMAT and GRE, and extended via turking	Task: Solve the problem	Paper	HuggingFace	- Intended to be partially used for training - Includes the rationale for the problems - Use accuracy, BLEU and perplexity for scoring ...
ASDiv-A	Online sources compilation	2020	2.3K	Math world grade-school problems collected from various websites and normalized	Task: Solve the problem	Paper	Github	- Contains problem type and grade level annotations by a Master’s student annotator - Focused on a high lexical diversity - Used 28 websites
CHAMP	Olympiad dataset	2024	270	Math word problems extracted from a book of olympic competitions examples, rewritten to make the solutions parsable, and annotated	Introduces a math bench. Problems are extended with hints and labeled with concepts, to allow ablation studies on performance	Paper		- Source: Book "Problem-Solving strategies" (Engel, 2008)
DeepMind Math	Exam dataset + synthetic dataset	2019	10K?	Synthetic raw math problems in algebra, arithmetic, calculus, comparision, conversions between units, polynomials, probability, etc.	Task: Solve the problem	Paper	HuggingFace	- Full list of domains in appendix B - Paper first section is quite nice - Provide generation code to generate more examples - Provides additional train set - Synthetic procedural dataset (inspired from/to extend? school exams dataset)
DocMath-Eval	Annotated financial reports + exisiting Fin math datasets	2023	3.2K	Combine financial reports and existing datasets, read by annotators to generate (or validate) questions, and provide answers as Python programs, then evaluated with domain expert annotators	Solutions should be presented as Python programs which will be run to test their validity	Paper	Source	- Re-uses TAT-QA, FinQA, MultiHiertt, TAT-HQA - Looks quite high q for math fin data! - Provides additional train set
Dolphin1878	Online sources compilations	2015	1.5K	Number math word problems sampled from online sources and re-annotated if needed	Paper is about extracting the equation (DOL) tree from the problem using semantic parsing	Paper	?	- Sources: algebra.com and answers.yahoo.com
Dolphin18K	Online sources compilations	2016	18K	Math word problems semi automatically extracted from online sources		Paper	Kaggle	- Sources: Math category of Yahoo answers since 2008 - Method: manually annotate 6K problems then use a classifier. Train a model to extract the gold - I'm not sure the quality is amazing there given the amount of automatic extraction (high) vs manual verif (low)
Draw-1K	Online sources compilations	2016	1K	General algebra word problems extracted from online sources	Paper is about evaluating solvers (systems generating equations from mwp), and test template and equation equivalence	Paper	Source	- Label each problem with the template it follows, can be useful for contam - Source: algebra.com
FinQA	Expert annotated financial reports	2021	1.1K	Financial questions linked to tables from earning reports. Annotators provide a question + step by step process + annotations for each page.	Paper introduces the dataset plus a method made of a retriever which extracts relevant facts first to accomodate short context models, then a process generator.	Paper	HuggingFace	- Likely high quality: used paid expert annotators + external experts annotators had high agreement on the task - Total set is 8.2K - Unsure from the paper how the tables are formatted - markdown maybe? - Data source: earnings reports of S&P 500 companies (1999-2019)
FrontierMath	Expert created datasset	2024	100+ (precise number unknown)	Entirely novel math problems created for the paper across most math domains. Solutions are either integer or SymPy objects to be automatically verifiable through unit-test like python programs. Problems are labelled.	Introduces the dataset	Paper	Private	- Nice discussion of contamination in the paper - Experts: 60 mathematicians over 12 countries - All problems are peer reviewed - Probably the highest quality dataset here atm - Data is not public; however, since closed source models have been evaluated, it's likely they'll be contaminated for it in future occurences :(
GAOKAO-Bench (MathCloze, MathQA)	Exam dataset	2023	~500	Math word problems at high school level from the Chinese college entry exams		Paper	Source, Datasets are only 2023 HuggingFace and HuggingFace	- Mathematical formulas are converted to latex - Problems are in Chinese - Dataset is updated yearly - Paper explores a bunch of grading methods, including LLM as judge - Paper contains surprinsingly little info about the dataset
GAOKAO 2023 (MathEn)	Exam dataset, Competition dataset	2023	385	Math word problems at high school level	Compiles questions from the 2023 Chinese National College Entrance Examination, the 2023 American Mathematics Competitions, and the 2023 American College Testing		HuggingFace
GSM1K	Manually created dataset in the style of another dataset	2024	1.2K	Diverse "grade school"-like math word problems, following the solving distribution of GSM8K	Paper does a contamination analysis of models on GSM8K vs GSM1K	Paper	Private	- Paper also seems to suggest that perplexity analysis is not very good at detecting contamination
GSM8K	Manually created dataset in the style of an exam dataset	2021	8.5K	Diverse grade school-level math word problems	Paper is about training verifiers to solve math word problems	Paper	Github Hugging Face	- Best results with an external calculator added - All answers are positive integers, 50% of answers are between 0 and 8 - Annotation used Upwork for 1K problems, then Scale for the next. Problems writers were provided with seed questions from a 175B GPT3 model
iGSM (med and hard sets)	Synthetic dataset	2024	20K	Problems are generated using a combination of dependency graphs between objects and categories (with direct, and implicit dependencies) and number of operations to generate new problems	Paper is about studying actual math reasoning on an extension of GSM8K, including probing internal model states	Paper	HuggingFace	- Idea is theoretically nice but problems generated are very unrealistic with high numbers of operation - Paper focuses on "mental process" of model which I find dubious (though the probing section is nice!) - So much anthropomorphism -_-
GSMHard	Adaptation of existing dataset, numbers replaced	2022	8.5K	GSM8K with bigger/less common numbers to make the problems harder. However, change was done automatically through programs generated, and only 25 changes were checked (+ 50 cases were done manually).	Paper is about using program-aided LMs (= generating CoT alternating equations and reasoning steps, and computing the solution on the last equation with Python)	Paper	Hugging Face	- Described in appendix H1 - Good idea, but not sure of the quality.
GSM-IC	Adaptation of existing dataset with perturbations	2023	58K	100 samples from GSM8K with irrelevant context added (using a template for the irrelevant sentence, plus roles/numbers fillers)	Paper tests how sensitive LLMs are to irrelevant context when reasoning on math tasks	Paper	HuggingFace
GSM-Plus	Adaptation of existing dataset with perturbations	2024	10K	GSM8K with 8 variations per question, added by GPT4 and manually annotated by selected humans (cross annotator agreement checked)	Paper introduces the dataset and compares results on several GSM8K variants across models and prompting formats	Paper	HuggngFace	- Changes include: replacing numbers by ohter numbers, changing the operations, changing the question, adding distractors, etc (nice typology of changes, I feel it could be extended)
GSM-Symbolic	Adaptation of existing dataset, templated	2024	8.5K	Templated GSM8K problems, which allows to generate new evals at will	Paper creates parsable templates from GSM8K to be able to generate new problems at will and analyse contamination on GSM8K	Paper	To be released	- Contains other specific subsets (M1, P1, P2, which are difficulty levels, as well as NoOp, with seemingly relevant but actually irrelevant info added), and some experiments are done with few shot formats - Lacking a dataset description table with all subsets imo
Hungarian HighSchool National Finals Exam	Exam dataset	2023	33	Problems from the 2023 hungarian national high school finals in math		Source	HuggingFace	- Require grading by hand atm
HMWP	Exam dataset	2020	5.4K	Annotated math word problems from a Chinese school-level (K-12) problems bank	Introduces a new formalism to represent MWP equations uniformly	Paper	HuggingFace	- Dataset is in Chinese - Sources: Chinese K12 problems
Math23K	Online sources compulation	2017	23K	Automatically extracted elementary school level math word problems.	Introduces a RNN to solve MWP	Paper	HuggingFace	- Sources: Chinese math word problems from online education websites for elementary school students. - Dataset is in Chinese - Extraction is rule based, but it's very unclear how much manual validation was done
Math401-LLM	Synthetic dataset	2023	401	Arithmetic expressions combining additions, substractions, multiplications, exponentiations, logarithms etc	Papers wants to measure strict arithmetic ability of models	Paper	Github	- Models are not that good atm for log/trig problems or big numbers
MATH	Olympiad datasets	2021	12.5K	Mathematical problems from real competitions in natural language and latex, annotated with difficulty levels.		Paper	HuggingFace	- Sources: AMC 10, AMC12, AOME, "and more" - Also introduces a train set created from scraping Khan academy and AMPS
MathOdyssey
MathQA	Adaptation of existing dataset, annotated	2019	37K	Annotated solvable problems from the AQuA dataset with formal annotation programs (using humans as annotators and testing their agreement)	Aims to introduce a representation language for math problems, applies the method to AQuA	Paper	HuggingFace	- Sources: AQuA
MAWPS	Existing dataset compilation	2016	3.3K	Math world problems from existing datasets	Framework to create new math problems, notably to remove lexical or template overlap when adding new datasets	Paper	Github	- Sources: ALG514, ALGES, and other pre-LLM datasets
MiniF2F	Olympiad dataset	2022	244	Olympiad math word problems formalized with theorem provers when possible (Lean, Methamath, Isabelle)	Paper is about testing math proof solvers ability to reason on formal logic	Paper	Possibly HuggingFace	- Sources: AIME, AMC, IMO
NPHardEval	Synthetic dataset	2023	900	Complexity math word problems of varied difficulty level built from synthetic graph/linear data	Paper introduces the benchmark and uses it to evaluate reasoning ability of models. Also explores benchmark robustness!	Paper	Github	- Problems: sorted array search, edit distance, shortest path, traveling salesman, graph coloring, knapsack problem, meeting scheduling problem - Can be regenerated as needed
NuminaMATH CoT	Existing dataset compilation	2024	860K	Math word problems (K12 + olympiad levels) combining existing datasets	NA	NA	HuggingFace	- Sources: AOPS, AMC, AIME, CN-K12, GSM8K, MATH, ORCA_math, Synthetic AMC and MATH data, and other Olympiads sets - careful if you use this as train set as you will be contaminated on all major math bencks
NuminaMATH TiR	Existing dataset compilation	2024	72K	Subset of NuminaMATH CoT focused on problems solvable with tool integrated reasoning	NA	NA	HuggingFace	- Sources: AOPS, AMC, AIME, CN-K12, GSM8K, MATH, ORCA_math, Synthetic AMC and MATH data, and other Olympiads sets - careful if you use this as train set as you will be contaminated on all major math bencks
OmniMath	Olympiad datasets	2024	2.2K	Olympiad math word problems. Problems are extracted from forums or olympiad websites (using rule based + LLM rephrasing), then annotated and verified by humans.	Paper introduces the benchmark and a judge trained to evaluate the answers (since they are free form)	Paper	HuggingFace	- Sources: IMO, IMC, AoPS forum and wiki - Domain labeling is done with LLMs
OlympiadBench	Olympiad datasets	2024	8.4K	Olympiad/math/physics word problems. Answers are automatically evaluated - either numbers or equations (evaluated with SymPy)		Paper		- Sources: Global Mathematics and Physics Olympiad Problems, Regional and National Chinese Math Competitions, and Gaokao Mock Questions for Mathematics and Physics - Includes a physics subset - VLM evaluation!
OlympicArena	Olympiad datasets	2024	11K			Paper
PRM800K	Synthetic data	2023	800K	Preference data from annotators on 800K solutions generated by a model	Paper introducing process supervision to improve reward models (compares output and process supervision)	Paper	HuggingFace	- More a train set than an evaluation
SVAMP	Adaptation of existing dataset	2021	1K	One-unknown arithmetic word problems of grade level up to 4, created with experts applying variations to ASDiv-A.	Paper wants to assess question sensitivitiy, reasoning ability, and structure invariance in models for math evaluations.	Paper	Github	- Variations: same object & different structure, opposite, both different, adding relevant or irrelevant information, changing information, inverting operations, changing order of sentences or objects
TabMWP	Online source adaptation	2022	38K	Tabular math word problems requiring multi-hop reasoning, extracted from an online educative website and manually annotated.	Paper wants to test tabular math reasoning, datast	Paper	HuggingFace	- Source: IXL learning website - Tabular data is provided as an image, semi-structured text, and a table - Answers are generative or MCQA - Dataset is tested against turkers
TAL-SCQ5K-En	Competitions dataset	2023	4K	Math word problems in MCQA format, with math expressions as latex	NA	None	HuggingFace	- Contains English and Chinese - Also contains 6K train samples and CoT
TemplateGSM	LLM-generated data	2024	7M	GPT4-generated math word problems inspired in shape by GSM8K	Paper uses GPT4 generated meta-template to generate problems by changing parameters. Uses a verificator to ensure usability	Paper	HuggingFace	- Since everything is LLM generated, I would expect stronger proofs of quality
TheoremQA	Online sources adaptations	2023	800	QAs about university level theorems	Protocol: Uses GPT4 to enumerate subfields of relevant domains, then plausible theorems lists, then uses domain experts to actually look for said theorems, then look for QA on the web concerning them	Paper	HuggingFace

Pre-LLM datasets

Evaluation name	Task type	Task data	Task content	Source	Dataset	Comments
DeepFix	Code task, Code-to-code, Correction	7K student-written erroneous C programs	Correct the C programs	Paper
MLSum	Generation, Multilingual, Summarization	1.5M news summary/article pairs from the DailyMail, Le Monde, Süddeutsche Zeitung, El Pais, Moskovskij Komsomolets and Internet Haber (en, fr, de, es, ru, tur)	Summarize the articles	Paper	Hugging Face	Palm: Prefixed with a prompt, truncated article to 2048 tokens
TransCoder	Code task, Code-to-code	852 parallel functions in Python/Java/C++	Translate from a language to another	Paper	From paper
WMT	Multilingual, Translation	Datasets from the WMT conf on machine translation - datasets available depend on the year	Translate from a language to another	Conference Replace the 2 digits by the conference year
Adversarial NLI	Language Inference	10K entailment dataset generated using human in the loop adversarial attacks, looking for predicates which force models to predict wrong entailement labels (uses contexts from StoryCloze, CommonCrawl, Wikipedia, the Open Annotated National Corpus, WikiHow and GLUE)	Predict entailment	Paper	Data Github	R1 to R3 = rounds of data generation
APPS	Text-to-code	10K Python coding problems in natural languages, scraped from leetcode sites, with a suite of test cases.	Solve the Python problem	Paper	Github Data
AQuA	Arithmetic, Reasoning	100K multiple choice problems (GMAT, GRE, other sources) with question/options/rationale	Select the correct MCQA	Paper	Github	Best results obtained with an external calculator added
ARC	Common Sense, Reasoning	8K Grade school science questions: e = easy set, c = challenge set	Select the correct MCQA	Paper	Data	Careful, this is the AI2 Reasoning Challenge, not the Abstraction and Reasoning Corpus
bAbI	Reasoning	20 tasks each with 2K automatically generated questions + short scenarios (successive actions generated with a simulated text adventure game).	Reason over the sentence to select the correct conclusion	Paper	Github Data	See Part 4 for the simulation env and its constraints, it’s quite a fun idea. Probably not too hard to reproduce for other types of reasoning.
BBQ	Bias detection	58K examples with two contexts (ambiguous and explicit about a bias), two questions (negative and non-negative) and possible answers, constructed from manual templates and checked with crowdsourcing.	Predict the correct, non biased answer. Difference between accuracies depending on context/question allows to build a bias score.	Paper	Github
BLiMP	Language Understanding	67 datasets of each artificially generated 1K minimal pairs testing syntactic, morphological and semantic knowledge, validated with MTurking.	Accuracy measured by looking if the log-probability the model assigned to the correct sentence is higher.	Paper	Github	Things tested: anaphor agreement, argument structure, binding, control/raisong, determiner-noun agreement, ellipsis, filler-gap, irregular forms, island effects, NPI licensing, quantifiers, subject-verb agreement
BOLD	Generation, Toxicity detection	23K prompts extracted from beginning of Wikipedia sentences containing a race/religion/political/gender/profession group member (ex: woman artist for gender=female).	Task: generating end of sentence, and toxicity is evaluated through a range of metrics (sentiment analysis, using classifiers, …). In HELM, toxicity is measured using the Perspective API.	Paper	Github
BooksCorpus	N/A	11K unpublished books of more than 20K words scraped from the web (of 16 different genres).	In the original paper, it's used to train a sentence embedding model - in model papers, it's often used for contamination or perplexity evaluations.	Paper	Hugging Face
BooksCorpus_HELM	Generation, Memorisation	1K randomly sampled books from BooksCorpus.	Task: from a random number of tokens beginning a paragraph, the model must generate a follow up - measure exact and near-exact reproduction.	Paper	Data
BoolQ	Language Inference, Language Understanding	16K sentences of naturally occurring yes no QA, from question + context from Wikipedia	Answer the MCQA	Paper	Website
CB	Language Understanding	1.2K of discourse (news from the Wall Street Journal, fiction from the British National Corpus, dialogue from Switchboard), containing context + target sentence	Predict commitment entailment	Paper	Website
Civil comments	Toxicity detection	1.8M online comments, with crowd sourced human labels for and toxicity following the Perspective API guidelines, and among these, 450K labeled with identity terms (crowdsourced, to pick in a list).	Task : toxicity prediction, labels are used to identify areas of biases in models	Paper	Kaggle Hugging Face	Original paper contained a synthetic test set (77K examples generated from templates using 50 identity terms, 50/50 on toxicity) and a human labeled dataset (description in the Task content col) - I suppose the dataset available is the latter
Clean E2E NLG	Description, Generation	50K crowdsourced generated descriptions of restaurants given keys and values (type of food = X, budget = Y, …).		Paper	Hugging Face	Palm: Prefixed with a prompt, truncated article to 2048 tokens
CNN/DailyMail	Cloze/Completion, Summarization	Original dataset: 200K new documents (CNN/DailyMail between 2007 and 2015) converted into Cloze format by removing some of the text’s named entities, and using them as keys.	In HELM: Uses above documents (in complete form) as text to summarize, and their highlights as gold summaries.	Paper	Hugging Face Data	(I suspect this does not produce very good summaries though)
CommonsenseQA	Common Sense, Reasoning	12K turked Q/A (initialized from ConceptNet associations), then filtered by quality, with added context from a Google Search query > Some text likely overlaps with CC data	Answer the MCQA	Paper		Best results with an external calculator added
Contrast Sets	Generation, Robustness	10 contrast sets of up to 1K examples, for their datasets (see comments) made by (often the original paper’s) researchers (increase reasoning steps in questions, replace words by their opposites, change counts …).	Follow original task setup with new examples, and see how/if performance drops. In HELM: use the IMDb and DROP contrast sets	Paper	Data	NLVR2, IMDb sentiment analysis, MATRES Temporal RE, English UD parsing, PERSPECTRUM, DROP, Quoref, ROPES, BoolQ, MC-TACO. Dataset construction is way more detailed in Appendix
COPA	Common Sense, Language Understanding	1K premise + causal questions with alternatives (common sense)		Paper	Website
CoQA	In-Context Reading Comprehension	127K Conversational QA, from a given context (rationale must be provided too) - written by annotators		Paper	v1.0 from Data
DataImputation	Real life task, Reasoning, Structured data	8 structured datasets from several sources.	Task: from a row of attributes with gaps, the model must fill the gaps (ex: extrapolating city from phone number, phone brand from its specs).	Paper	Data restaurant Data Buy	See table 2 for all sources. In HELM: use the subsets Buy and Restaurant, convert input to natural language, test accuracy.
Digits arithmetics (2D+, 2D-, 3D+, 3D-, …)	Arithmetic	Basic arithmetic tasks for n digits addition, subtraction, composite operations with 2K example each	Task: solve the math	Paper	Github	All links come from the lm-evaluation-harness/lm_eval/datasets/arithmetic
DROP	Arithmetic, In-Context Reading Comprehension	55K adversarial questions which require 1) selecting relevant items from the text and 2) computing on them (sorting/counting/…) to get the correct answer	Task: select and count to provide the correct answer	Paper	Data
Dyck language_HELM	Symbolic manipulation	500 D_n words between 52 and 100 characters (”word” made of nested brackets/parenthesis) where the last i characters have been removed.	Task: Predict the unique sequence of closing parentheses.	Paper	Github Also has a different version in BigBench
HellaSwag	Cloze/Completion	60K adversarially filtered multiple choice Q/A	Choose the correct next sentence (from captions or WikiHow)	Paper	Github
HumanEval	Code task, Text-to-code	164 hand written programming problems with function signature, docstring, body + unit tests	Aim is to complete function to pass unit tests	Paper	Hugging Face
IMDB	Sentiment Analysis	50K reviews from IMDB, with even positive (score ≥ 7) /negative (score ≤ 4) reviews (no neutral).	Classify positive/negative review.	Paper	Website
LAMBADA	Cloze/Completion	10K Narrative contexts (from the BookCorpus) followed by a sentence where the last word is masked and must be predicted. Specifically built to force use of the context.	Predict the last word.	Paper	Zenodo
Language Modeling Evaluation_HELM	Language Modeling	Compilation in HELM of several datasets: WikiText-103, ThePile (particularly arXiv, BooksCorpus2, Enron Emails, PubMed Central, Wikipedia), TwitterAAE, ICE.	Task: get conditional log probability of the full sequence (perplexity measure)	Paper	The pile website BLIMP Github Wikitext data Twitter AAE data ICE data
LegalSupport	Entailment, Real life task, Reasoning	20K legal entailment scenarios, constructed from state/federal legal opinions (assertion is used as context, and 2 supporting sources (”see X, rule”) are selected at random).	Task: finding which rule best supports assertion.	Paper	Data
LinuxKernel_HELM	Generation, Memorisation	2K randomly sampled functions from the Linux Kernel.	Task: from a random number of lines beginning a function, the model must generate a follow up - measure exact and near-exact reproduction.	Paper	Data
LSAT	Analytical Reasoning, In-Context Reading Comprehension, Logical Reasoning	10K questions from the Law School Admission Test (analytical, logical reasoning, and reading comprehension), with context.	Answer the MCQA correctly	Paper	Github
Magellan Benchmark	Real life task, Reasoning, Structured data	23 datasets from several sources containing entities with attributes. Dirty datasets contain deliberate errors, such as attributes being in the wrong column, misspellings, etc.	Task: given two entities from two different tables, determine if they are the same or not.	Paper	Github	It’s likely that Abt-Buy and Buy are the same dataset
MBPP	Code task, Text-to-code	1K entry-level Python crowd-sourced programming problems (description, solution, 3 unit test cases) - (58% mathematical, 43% list processing, 19% string processing, 9% integer sequences, and 2% other).	Solve the Python program	Paper	Github Hugging Face	Also contains an edited version (400 items) with unambiguous prompts and good signatures (can be interesting to look at later to see the impact of prompts on code gen) + a MathQA-Python dataset (adaptation of the MathQA dataset)
MMLU	Language Understanding	15K multi-choice Q/A manually collected from various online sources, on many topics (legal, philosophy, economics, psychology, STEM, medicine, etc, - at high school to professional level)	Answer the MCQA	Paper	Hugging Face Github	Seems like a strong/high-quality baseline
MRF (Misinfo Reaction Frames)	Generation, Misinformation capabilities	200K pairs of claims from news headlines (climate change, covid 19, cancer illness, detailed sources in comments) + label (real/misinformation), the former annotated on veracity, likelihood of spread, writer intent by MTurk workers.	Task: must either predict the gold label or generate likely writer intent/reader perception/…	Paper	Github	(Contains data from NELA-GT-2018-2020, SciDCC, Climate-FEVER, CoAID, CoronaVirusFacts/DatosCoronaVirusAlliance Database, ESOC Covid-19 Misinformation Dataset, DETERRENT)
MS MARCO	Question Answering, Retrieval	1M anonymised questions with free-form human generated answers (from relevant web document extracts), some with added rewriting.	Original paper contains 3 tasks: 1) generating the correct answer, if possible, 2) same but answer should make sense even without context, 3) ranking 1000 passages on how relevant they are for the question.	Paper	Github	Contains extended descriptions of QA datasets in lit review. In HELM, only the ranking task is looked at, and relevance is estimated looking at the log-likelihood of the prediction when asking “Does the passage answer the query?”
MS MARCO TREC, aka TREC 2019	Retrieval	Datasets derived from MS MARCO, edited for either passage or document retrieval tasks, either doing full retrieval or top-n reranking (100 for documents, 1000 for passages). (see MS MARCO)		Paper	Data Github
MultiRC	Language Understanding, Question Answering	6K multiple choice question over a diversity of topics		Paper	Data
NarrativeQA	Question Answering, Retrieval	47K free-form human generated questions and answers, linked to 1.5K books (Gutemberg project) and movie scripts (scraped) matched with plot summaries.	Task: from the summary or the story, answer or select the correct answer.	Paper	Github	For long range context testing, we could use this dataset to do QA from the full stories. Could be interesting for anything conversational imo.
Natural Questions	Open domain/Closed book, Question Answering	207K aggregated google search queries + annotated wikipedia sample answer		Paper	Data
NewsQA	Question Answering	100K human generated QA pairs from 12K news articles (CNN). Questions were generated from title + summary, answers from question + article, then kept through a validation mechanism. Likely intersects with CNN/DailyMail, as data extraction script was the same.		Paper	Github
OpenBookQA	Common Sense, Reasoning	6K sentences, science reasoning needing common sense knowledge to extrapolate to new situations		Paper	Data
PIQA	Common Sense, Reasoning	20K physical common sense reasoning situations,	select the correct action to do from a context and answers	Paper	Data
PopularBooksCorpus_HELM	Generation, Memorisation	20 books from BooksCorpus which appear in a list of bestsellers.	Task: from a random number of tokens beginning the first paragraph of the book, the model must generate a follow up - measure exact and near-exact reproduction.	Paper	Data
QuAC	In-Context Reading Comprehension	100K questions in information seeking QA contexts (used Wikipedia to generate dataset)		Paper	Data
RACE	In-Context Reading Comprehension	100K questions from English reading comprehension exam for Chinese mid/high school students		Paper	Data
RAFT	Real life task, Text classification	Compilation of 11 datasets of naturally occurring classification tasks, of between 150 and 5K test items.	Task: in few shot from 50 labeled examples, provide meaningful labels. (Domains: medical, finance, research, english language, law, physics, AI safety, social networks)	Paper	Hugging Face	Corpus: (ADE Corpus v2, Banking77, NeurIPS 2020 impact statement risks, OneStopEnglish, Overrruling, Systematic review inclusion, TAI safety research, Terms of Service, TweetEval Hate, Twitter complaints, + Semiconductor org types, created for this)
RealToxicityPrompts	Generation, Toxicity detection	100K natural occurring sentences (selected from OpenWebText corpus, basically = reddit, and scored for toxicity with the PerspectiveAPI) split in two to create a prompt and continuation.	Task: generate the continuation from the sentence start, then toxicity evaluated with PerspectiveAPI.	Paper	Data Github (the repo lacks a lot of info)
ReCoRD	Language Understanding	120K passage/cloze query/answer examples from news (CNN, DailyMail) with human filtering		Paper	Data
RTE	Language Understanding	3K compilation of competition data on entailement		Paper	Data
SAT analogies	Language Understanding	374 SAT analogy problem prior to 2005 (a is to b what c is to multiple choice questions; words are not the most frequent)		Paper	Data dev Data test
SIQA	Question Answering
SQuADv2	In-Context Reading Comprehension	Combines SQuAD with 50K unanswerable questions	from a context, give an answer, but only if possible	Paper	Github
StoryCloze	Cloze/Completion, Common Sense	50K 5-sentences commonsense stories	choose the correct ending	Paper	Hugging Face
StrategyQA	Common Sense, Reasoning	2.8K questions needing reasoning from implicit knowledge		Paper		Best results with an external calculator added
Synthetic reasoning (natural)	Logical Reasoning, Reasoning	Synthetic data generated on the fly, containing a set of synthetic rules (conditional statements), facts (attributes), and the logical gold output.		Paper	Can be generated with Github	Also called rule_induct in HELM
Synthetic reasoning (symbolic)_HELM	Logical Reasoning, Symbolic manipulation	Synthetic data generated on the fly using a pattern template.	Either test if the model is able to identify patterns (”beach + beach - pear” has “A + A - B” as pattern) or if the model can substitute strings in a given pattern.	Paper	Can be generated with Github
TriviaQA	Open domain/Closed book, Question Answering	95K trivia QA (compositional questions, syntactic variability)		Paper	Data
TruthfulQA	Question Answering	817 questions about tricky factual claims (common misconceptions, falsehoods, …) over 38 categories, with true and false reference answers + a source to support true answers (+ 380 added questions).		Paper	Github
TyDiQA-GoldP	Multilingual, Question Answering	204K multilingual QA pairs (unconstrained question elicitation from prompts, then Wikipedia article retrieval, and specific answer selection in the article if possible) (en, ar, ben, fin, ind, ja, ko, ru, tel, th and kiswahili).	MCQA	Paper	Github	Dataset generated can present interesting underspecification of questions and mismatch between question and answers language level. Might be harder than other datasets
Web Questions	Open domain/Closed book, Question Answering	Extracted 100K “Wh?” questions from Google Search API, then annotated by MTurkers - I suspect answers are partially out of date	MCQA	Paper	Website
WebNLG	Generation, Verbalization	13K mappings between triples (subject, property, object, constructed from DBPedia, which is a KB from Wikipedia) and sentence verbalization (by crowd workers), about specific topics (astronauts, universities, monuments, buildings, characters from comics, food, airports, sports teams, written works).	Task: verbalize in a grammatical way.	Paper	Hugging Face	There was a sentence selection for fluency and the sentences generated are relatively simple, but there is no description of annotators/crowdsourcers origins > maybe some data is not in “standard English”.
WiC	Language Understanding	7K, classification of whether a word occurring in two different contexts has the same meaning or not		Paper	Site
WikiFact_HELM	Cloze/Completion	12 domains with 1K triples (subject, relation, object) sampled from Wikipedia and cleaned.	Task: predict missing item in the sentence made of the relation.	Paper	Codalab Github
WikiLingua	Generation, Multilingual, Summarization	43K article/summary pairs constructed from WikiHow in 18 languages (on the site, articles are written with a summary sentence + detailed paragraph per step: in the dataset, summaries are the concatenation of the summary sentences, and articles of the detailed paragraphs	Summarization	Paper	Github	Palm: Prefixed with a prompt, truncated article to 2048 tokens I suspect data creation can leads to very “robotic” language for the summary baseline, which could underscore more fluid summaries - though ROUGE shouldn’t be too prone to that).
Winogender	Bias detection
Winograd	Reasoning, Winograd	273 to 285 examples where one must disambiguate who/what a pronoun is referring to on sentence specially constructed to be ambiguous to statistics not to humans	Disambiguation of pronoun	Paper	Website	Not sure if GPT3 was evaled on this one or the SuperGLUE one
WinoGrande	Reasoning, Winograd	43K sentences Adversarial Winograd		Paper	Website
WSC	Language Understanding, Winograd	WinoGrad Schema Challenge (see Winograd)		Paper	Website
XSUM	Summarization	226K news articles (BBC, 2010 to 2017) matched with their single sentence summary (comes from the article). Task: Summarize. (Domains: News, Politics, Sports, Weather, Business, Technology, Science, Health, Family, Education, Entertainment and Arts)		Paper	Github
XSum	Generation, Summarization	226K news summary/article pairs from the BBC (2010 - 2017) extracted from the WayBack machine		Paper	Hugging Face	Could be interesting to manually check if the model recent knowledge creates discrepancies in the summaries of old news.

Dataset ideas to manually reproduce

Evaluation name	Task type	Task content	Source	Dataset	Comments
✍️ GSM8K-Python	Code task, Text-to-code	Python version of the GSM8K dataset (8.5K grade school math problems)	Paper	N/A
✍️ MRF	Generation, Manual evaluation, Misinformation capabilities	250 headlines extracted from the MRF dataset, grouped in 80 clusters by thesis. Task: from the thesis + 5 headlines, the model must generate plausible headlines which supports the thesis. Annotators evaluate if 1) the headline supports the thesis and 2) looks real.	Paper	Data	See report page 6 for a detailed explanation of the original process, plus sections 8.5.2, E.5, and 5.5 in the HELM paper.
✍️ News article generation	Generation	Generated 25 articles from titles and subtitles, 80 humans had to classify if generated or original	Paper
✍️ Numeracy Prediction	Symbolic manipulation	“requires the model to perform symbolic regression given a few examples, and apply the number relationship to a new input”		Paper	Github
✍️ SVG datasets		Could construct an SVG dataset to see if models can indeed generate or interpret SVG drawings	Twitter thread
✍️ Theory of the mind datasets		Could likely be easy to generate	Paper
✍️ Wedging prompts	Generation, Manual evaluation, Misinformation capabilities	11 prompts with specific intent (ex: influence voting behaviors, target specific groups by generate pro/anti X rhetoric) augmented with 3 examples. Task: generate follow up examples.	Paper	Data	In HELM: use manual evaluation to determine if the message generate 1) addresses targeted group; 2) supports desired message; 3) is divisive
✍️ Word scrambling	Symbolic manipulation	10K examples for 5 tasks of 5 character manipulation tasks (word with cycled letters, anagrammed, random insertions, reversed). Model needs to recover the original word	Paper		Easy to generate/automate, see Section 3.9.2 of GPT3 paper