Some evaluation datasets

November 26, 2024 · View on GitHub

If the task you are interested is already well studied, chances are that a dataset exists for it.

Below are a number of evaluation datasets which were developed in the last few years.

However, careful:

  • Some of them can be obsolete, as they were designed pre-LLM and are now easily solved: they aimed to investigate one specific property of text (translation, summarization) which is no longer really how we evaluate models (evaluations are now more general/holistic). (If you've got some bandwidth, this could really benefit from adding the publication dates!) (This will also be updated with post LLM evals at some point)
  • They are likely contaminated, as they have been publicly on the web for a number of years. However, it doesn't mean they won't hold signal for your task!

Math specific datasets

Evaluation nameTask typePublication dateData sizeTask dataTask/Paper contentSourceDatasetComments
AGIEval (SATMath)Exam dataset + existing datasets2023220Math problems from the SATPaper is actually a compilation of a bunch of human relative exams to use as eval data.PaperHuggingFace- Careful, this paper also includes datasets from other papers! For math, they use AIME & AMC through the MATH dataset, GRE & GMAT through the AQuA-Rat dataset, and GaoKao
- Metrics: acc/em/f1
AIME (all)Olympiad dataset1983-now15 x 2 per yearMathematical problems requiring a combination of arithmetic, algebra, counting, geometry, number theory, probability and other secondary school math topics2nd exam to choose the US team for the International Math OlympiadsBlogSourceAnswer is systematically an integer between 0 and 999.
AIME (22, 23 and 24)Olympiad dataset202490
See AIME (all)PaperHuggingFaceUsed in the AIMO competition
ALGES (SingleEQ)Online sources compilations2015508Grade school algebra problems extracted from sources in the webPaper is about implicitely learning and solving the simple equation behind the problemPaperSource- Web sources: http://math-aids.com, http://k5learning.com, and http://ixl.com
- Pre-LLM paper - data sources are probably OK
ALG514 or AllEqOnline forum2014514Algebra word problems extracted from a crowdsourced tutoring website, cleaned with turking, and manually verifiedPaper is about extracting the equation template from the problem to solve itPaperSource- Web source: Algebra.com
AMC 12Olympiad dataset2000 - now25 per yearMathematical word problems requiring arithmetic, algebra, counting, geometry, number theory, probability and other secondary school math topics.1st exam to select the US team for the International Math Olympiad, used to be the Americal High School Math examBlogSource- Problems are designed to be solvable by students without any background in calculus.
Ape210KExam dataset2020210K problems, 56K templatesChinese elementary school-level math word problems written by math teachersSolve the problemsPaper (withdrawn, but v1 still accessible)HuggingFace- Some problems are templated, which could be interesting for contamination issues
- Initial dataset is 900K and was manually filtered
- Also provides "intermediate equations" (useful to test CoT traces if needed)
- Dataset is in Chinese
- Intended to be partially used for training
AQUA or AQUA-RatExam dataset + turked dataset2017100KAlgebraic word problems constructed from a seed of 34K problems from GMAT and GRE, and extended via turkingTask: Solve the problemPaperHuggingFace- Intended to be partially used for training
- Includes the rationale for the problems
- Use accuracy, BLEU and perplexity for scoring ...
ASDiv-AOnline sources compilation20202.3KMath world grade-school problems collected from various websites and normalizedTask: Solve the problemPaperGithub- Contains problem type and grade level annotations by a Master’s student annotator
- Focused on a high lexical diversity
- Used 28 websites
CHAMPOlympiad dataset2024270Math word problems extracted from a book of olympic competitions examples, rewritten to make the solutions parsable, and annotatedIntroduces a math bench. Problems are extended with hints and labeled with concepts, to allow ablation studies on performancePaper- Source: Book "Problem-Solving strategies" (Engel, 2008)
DeepMind MathExam dataset + synthetic dataset201910K?Synthetic raw math problems in algebra, arithmetic, calculus, comparision, conversions between units, polynomials, probability, etc.Task: Solve the problemPaperHuggingFace- Full list of domains in appendix B
- Paper first section is quite nice
- Provide generation code to generate more examples
- Provides additional train set
- Synthetic procedural dataset (inspired from/to extend? school exams dataset)
DocMath-EvalAnnotated financial reports + exisiting Fin math datasets20233.2KCombine financial reports and existing datasets, read by annotators to generate (or validate) questions, and provide answers as Python programs, then evaluated with domain expert annotatorsSolutions should be presented as Python programs which will be run to test their validityPaperSource- Re-uses TAT-QA, FinQA, MultiHiertt, TAT-HQA
- Looks quite high q for math fin data!
- Provides additional train set
Dolphin1878Online sources compilations20151.5KNumber math word problems sampled from online sources and re-annotated if neededPaper is about extracting the equation (DOL) tree from the problem using semantic parsingPaper?- Sources: algebra.com and answers.yahoo.com
Dolphin18KOnline sources compilations201618KMath word problems semi automatically extracted from online sourcesPaperKaggle- Sources: Math category of Yahoo answers since 2008
- Method: manually annotate 6K problems then use a classifier. Train a model to extract the gold
- I'm not sure the quality is amazing there given the amount of automatic extraction (high) vs manual verif (low)
Draw-1KOnline sources compilations20161KGeneral algebra word problems extracted from online sourcesPaper is about evaluating solvers (systems generating equations from mwp), and test template and equation equivalencePaperSource- Label each problem with the template it follows, can be useful for contam
- Source: algebra.com
FinQAExpert annotated financial reports20211.1KFinancial questions linked to tables from earning reports. Annotators provide a question + step by step process + annotations for each page.Paper introduces the dataset plus a method made of a retriever which extracts relevant facts first to accomodate short context models, then a process generator.PaperHuggingFace- Likely high quality: used paid expert annotators + external experts annotators had high agreement on the task
- Total set is 8.2K
- Unsure from the paper how the tables are formatted - markdown maybe?
- Data source: earnings reports of S&P 500 companies (1999-2019)
FrontierMathExpert created datasset2024100+ (precise number unknown)Entirely novel math problems created for the paper across most math domains. Solutions are either integer or SymPy objects to be automatically verifiable through unit-test like python programs. Problems are labelled.Introduces the datasetPaperPrivate- Nice discussion of contamination in the paper
- Experts: 60 mathematicians over 12 countries
- All problems are peer reviewed
- Probably the highest quality dataset here atm
- Data is not public; however, since closed source models have been evaluated, it's likely they'll be contaminated for it in future occurences :(
GAOKAO-Bench (MathCloze, MathQA)Exam dataset2023~500Math word problems at high school level from the Chinese college entry examsPaperSource,
Datasets are only 2023
HuggingFace and HuggingFace
- Mathematical formulas are converted to latex
- Problems are in Chinese
- Dataset is updated yearly
- Paper explores a bunch of grading methods, including LLM as judge
- Paper contains surprinsingly little info about the dataset
GAOKAO 2023 (MathEn)Exam dataset, Competition dataset2023385Math word problems at high school levelCompiles questions from the 2023 Chinese National College Entrance Examination, the 2023 American Mathematics Competitions, and the 2023 American College TestingHuggingFace
GSM1KManually created dataset in the style of another dataset20241.2KDiverse "grade school"-like math word problems, following the solving distribution of GSM8KPaper does a contamination analysis of models on GSM8K vs GSM1KPaperPrivate- Paper also seems to suggest that perplexity analysis is not very good at detecting contamination
GSM8KManually created dataset in the style of an exam dataset20218.5KDiverse grade school-level math word problemsPaper is about training verifiers to solve math word problemsPaperGithub
Hugging Face
- Best results with an external calculator added
- All answers are positive integers, 50% of answers are between 0 and 8
- Annotation used Upwork for 1K problems, then Scale for the next. Problems writers were provided with seed questions from a 175B GPT3 model
iGSM (med and hard sets)Synthetic dataset202420KProblems are generated using a combination of dependency graphs between objects and categories (with direct, and implicit dependencies) and number of operations to generate new problemsPaper is about studying actual math reasoning on an extension of GSM8K, including probing internal model statesPaperHuggingFace- Idea is theoretically nice but problems generated are very unrealistic with high numbers of operation
- Paper focuses on "mental process" of model which I find dubious (though the probing section is nice!)
- So much anthropomorphism -_-
GSMHardAdaptation of existing dataset, numbers replaced20228.5KGSM8K with bigger/less common numbers to make the problems harder. However, change was done automatically through programs generated, and only 25 changes were checked (+ 50 cases were done manually).Paper is about using program-aided LMs (= generating CoT alternating equations and reasoning steps, and computing the solution on the last equation with Python)PaperHugging Face- Described in appendix H1
- Good idea, but not sure of the quality.
GSM-ICAdaptation of existing dataset with perturbations202358K100 samples from GSM8K with irrelevant context added (using a template for the irrelevant sentence, plus roles/numbers fillers)Paper tests how sensitive LLMs are to irrelevant context when reasoning on math tasksPaperHuggingFace
GSM-PlusAdaptation of existing dataset with perturbations202410KGSM8K with 8 variations per question, added by GPT4 and manually annotated by selected humans (cross annotator agreement checked)Paper introduces the dataset and compares results on several GSM8K variants across models and prompting formatsPaperHuggngFace- Changes include: replacing numbers by ohter numbers, changing the operations, changing the question, adding distractors, etc (nice typology of changes, I feel it could be extended)
GSM-SymbolicAdaptation of existing dataset, templated20248.5KTemplated GSM8K problems, which allows to generate new evals at willPaper creates parsable templates from GSM8K to be able to generate new problems at will and analyse contamination on GSM8KPaperTo be released- Contains other specific subsets (M1, P1, P2, which are difficulty levels, as well as NoOp, with seemingly relevant but actually irrelevant info added), and some experiments are done with few shot formats
- Lacking a dataset description table with all subsets imo
Hungarian HighSchool National Finals ExamExam dataset202333Problems from the 2023 hungarian national high school finals in mathSourceHuggingFace- Require grading by hand atm
HMWPExam dataset20205.4KAnnotated math word problems from a Chinese school-level (K-12) problems bankIntroduces a new formalism to represent MWP equations uniformlyPaperHuggingFace- Dataset is in Chinese
- Sources: Chinese K12 problems
Math23KOnline sources compulation201723KAutomatically extracted elementary school level math word problems.Introduces a RNN to solve MWPPaperHuggingFace- Sources: Chinese math word problems from online education websites for elementary school students.
- Dataset is in Chinese
- Extraction is rule based, but it's very unclear how much manual validation was done
Math401-LLMSynthetic dataset2023401Arithmetic expressions combining additions, substractions, multiplications, exponentiations, logarithms etcPapers wants to measure strict arithmetic ability of modelsPaperGithub- Models are not that good atm for log/trig problems or big numbers
MATHOlympiad datasets202112.5KMathematical problems from real competitions in natural language and latex, annotated with difficulty levels.PaperHuggingFace- Sources: AMC 10, AMC12, AOME, "and more"
- Also introduces a train set created from scraping Khan academy and AMPS
MathOdyssey
MathQAAdaptation of existing dataset, annotated201937KAnnotated solvable problems from the AQuA dataset with formal annotation programs (using humans as annotators and testing their agreement)Aims to introduce a representation language for math problems, applies the method to AQuAPaperHuggingFace- Sources: AQuA
MAWPSExisting dataset compilation20163.3KMath world problems from existing datasetsFramework to create new math problems, notably to remove lexical or template overlap when adding new datasetsPaperGithub- Sources: ALG514, ALGES, and other pre-LLM datasets
MiniF2FOlympiad dataset2022244Olympiad math word problems formalized with theorem provers when possible (Lean, Methamath, Isabelle)Paper is about testing math proof solvers ability to reason on formal logicPaperPossibly HuggingFace- Sources: AIME, AMC, IMO
NPHardEvalSynthetic dataset2023900Complexity math word problems of varied difficulty level built from synthetic graph/linear dataPaper introduces the benchmark and uses it to evaluate reasoning ability of models. Also explores benchmark robustness!PaperGithub- Problems: sorted array search, edit distance, shortest path, traveling salesman, graph coloring, knapsack problem, meeting scheduling problem
- Can be regenerated as needed
NuminaMATH CoTExisting dataset compilation2024860KMath word problems (K12 + olympiad levels) combining existing datasetsNANAHuggingFace- Sources: AOPS, AMC, AIME, CN-K12, GSM8K, MATH, ORCA_math, Synthetic AMC and MATH data, and other Olympiads sets
- careful if you use this as train set as you will be contaminated on all major math bencks
NuminaMATH TiRExisting dataset compilation202472KSubset of NuminaMATH CoT focused on problems solvable with tool integrated reasoningNANAHuggingFace- Sources: AOPS, AMC, AIME, CN-K12, GSM8K, MATH, ORCA_math, Synthetic AMC and MATH data, and other Olympiads sets
- careful if you use this as train set as you will be contaminated on all major math bencks
OmniMathOlympiad datasets20242.2KOlympiad math word problems. Problems are extracted from forums or olympiad websites (using rule based + LLM rephrasing), then annotated and verified by humans.Paper introduces the benchmark and a judge trained to evaluate the answers (since they are free form)PaperHuggingFace- Sources: IMO, IMC, AoPS forum and wiki
- Domain labeling is done with LLMs
OlympiadBenchOlympiad datasets20248.4KOlympiad/math/physics word problems. Answers are automatically evaluated - either numbers or equations (evaluated with SymPy)Paper- Sources: Global Mathematics and Physics Olympiad Problems, Regional and National Chinese Math Competitions, and Gaokao Mock Questions for Mathematics and Physics
- Includes a physics subset
- VLM evaluation!
OlympicArenaOlympiad datasets202411KPaper
PRM800KSynthetic data2023800KPreference data from annotators on 800K solutions generated by a modelPaper introducing process supervision to improve reward models (compares output and process supervision)PaperHuggingFace- More a train set than an evaluation
SVAMPAdaptation of existing dataset20211KOne-unknown arithmetic word problems of
grade level up to 4, created with experts applying variations to ASDiv-A.
Paper wants to assess question sensitivitiy, reasoning ability, and structure invariance in models for math evaluations.PaperGithub- Variations: same object & different structure, opposite, both different, adding relevant or irrelevant information, changing information, inverting operations, changing order of sentences or objects
TabMWPOnline source adaptation202238KTabular math word problems requiring multi-hop reasoning, extracted from an online educative website and manually annotated.Paper wants to test tabular math reasoning, datastPaperHuggingFace- Source: IXL learning website
- Tabular data is provided as an image, semi-structured text, and a table
- Answers are generative or MCQA
- Dataset is tested against turkers
TAL-SCQ5K-EnCompetitions dataset20234KMath word problems in MCQA format, with math expressions as latexNANoneHuggingFace- Contains English and Chinese
- Also contains 6K train samples and CoT
TemplateGSMLLM-generated data20247MGPT4-generated math word problems inspired in shape by GSM8KPaper uses GPT4 generated meta-template to generate problems by changing parameters. Uses a verificator to ensure usabilityPaperHuggingFace- Since everything is LLM generated, I would expect stronger proofs of quality
TheoremQAOnline sources adaptations2023800QAs about university level theoremsProtocol: Uses GPT4 to enumerate subfields of relevant domains, then plausible theorems lists, then uses domain experts to actually look for said theorems, then look for QA on the web concerning themPaperHuggingFace

Pre-LLM datasets

Evaluation nameTask typeTask dataTask contentSourceDatasetComments
DeepFixCode task, Code-to-code, Correction7K student-written erroneous C programsCorrect the C programsPaper
MLSumGeneration, Multilingual, Summarization1.5M news summary/article pairs from the DailyMail, Le Monde, Süddeutsche Zeitung, El Pais, Moskovskij Komsomolets and Internet Haber (en, fr, de, es, ru, tur)Summarize the articlesPaperHugging FacePalm: Prefixed with a prompt, truncated article to 2048 tokens
TransCoderCode task, Code-to-code852 parallel functions in Python/Java/C++Translate from a language to anotherPaperFrom paper
WMTMultilingual, TranslationDatasets from the WMT conf on machine translation - datasets available depend on the yearTranslate from a language to anotherConference
Replace the 2 digits by the conference year
Adversarial NLILanguage Inference10K entailment dataset generated using human in the loop adversarial attacks, looking for predicates which force models to predict wrong entailement labels (uses contexts from StoryCloze, CommonCrawl, Wikipedia, the Open Annotated National Corpus, WikiHow and GLUE)Predict entailmentPaperData
Github
R1 to R3 = rounds of data generation
APPSText-to-code10K Python coding problems in natural languages, scraped from leetcode sites, with a suite of test cases.Solve the Python problemPaperGithub
Data
AQuAArithmetic, Reasoning100K multiple choice problems (GMAT, GRE, other sources) with question/options/rationaleSelect the correct MCQAPaperGithubBest results obtained with an external calculator added
ARCCommon Sense, Reasoning8K Grade school science questions: e = easy set, c = challenge setSelect the correct MCQAPaperDataCareful, this is the AI2 Reasoning Challenge, not the Abstraction and Reasoning Corpus
bAbIReasoning20 tasks each with 2K automatically generated questions + short scenarios (successive actions generated with a simulated text adventure game).Reason over the sentence to select the correct conclusionPaperGithub
Data
See Part 4 for the simulation env and its constraints, it’s quite a fun idea. Probably not too hard to reproduce for other types of reasoning.
BBQBias detection58K examples with two contexts (ambiguous and explicit about a bias), two questions (negative and non-negative) and possible answers, constructed from manual templates and checked with crowdsourcing.Predict the correct, non biased answer. Difference between accuracies depending on context/question allows to build a bias score.PaperGithub
BLiMPLanguage Understanding67 datasets of each artificially generated 1K minimal pairs testing syntactic, morphological and semantic knowledge, validated with MTurking.Accuracy measured by looking if the log-probability the model assigned to the correct sentence is higher.PaperGithubThings tested: anaphor agreement, argument structure, binding, control/raisong, determiner-noun agreement, ellipsis, filler-gap, irregular forms, island effects, NPI licensing, quantifiers, subject-verb agreement
BOLDGeneration, Toxicity detection23K prompts extracted from beginning of Wikipedia sentences containing a race/religion/political/gender/profession group member (ex: woman artist for gender=female).
Task: generating end of sentence, and toxicity is evaluated through a range of metrics (sentiment analysis, using classifiers, …). In HELM, toxicity is measured using the Perspective API.
PaperGithub
BooksCorpusN/A11K unpublished books of more than 20K words scraped from the web (of 16 different genres).

In the original paper, it's used to train a sentence embedding model - in model papers, it's often used for contamination or perplexity evaluations.PaperHugging Face
BooksCorpus_HELMGeneration, Memorisation1K randomly sampled books from BooksCorpus.Task: from a random number of tokens beginning a paragraph, the model must generate a follow up - measure exact and near-exact reproduction.PaperData
BoolQLanguage Inference, Language Understanding16K sentences of naturally occurring yes no QA, from question + context from WikipediaAnswer the MCQAPaperWebsite
CBLanguage Understanding1.2K of discourse (news from the Wall Street Journal, fiction from the British National Corpus, dialogue from Switchboard), containing context + target sentencePredict commitment entailmentPaperWebsite
Civil commentsToxicity detection1.8M online comments, with crowd sourced human labels for and toxicity following the Perspective API guidelines, and among these, 450K labeled with identity terms (crowdsourced, to pick in a list).Task : toxicity prediction, labels are used to identify areas of biases in modelsPaperKaggle
Hugging Face
Original paper contained a synthetic test set (77K examples generated from templates using 50 identity terms, 50/50 on toxicity) and a human labeled dataset (description in the Task content col) - I suppose the dataset available is the latter
Clean E2E NLGDescription, Generation50K crowdsourced generated descriptions of restaurants given keys and values (type of food = X, budget = Y, …).PaperHugging FacePalm: Prefixed with a prompt, truncated article to 2048 tokens
CNN/DailyMailCloze/Completion, SummarizationOriginal dataset: 200K new documents (CNN/DailyMail between 2007 and 2015) converted into Cloze format by removing some of the text’s named entities, and using them as keys.
In HELM: Uses above documents (in complete form) as text to summarize, and their highlights as gold summaries.PaperHugging Face
Data
(I suspect this does not produce very good summaries though)
CommonsenseQACommon Sense, Reasoning12K turked Q/A (initialized from ConceptNet associations), then filtered by quality, with added context from a Google Search query > Some text likely overlaps with CC dataAnswer the MCQAPaperBest results with an external calculator added
Contrast SetsGeneration, Robustness10 contrast sets of up to 1K examples, for their datasets (see comments) made by (often the original paper’s) researchers (increase reasoning steps in questions, replace words by their opposites, change counts …).Follow original task setup with new examples, and see how/if performance drops.
In HELM: use the IMDb and DROP contrast sets
PaperDataNLVR2, IMDb sentiment analysis, MATRES Temporal RE, English UD parsing, PERSPECTRUM, DROP, Quoref, ROPES, BoolQ, MC-TACO.

Dataset construction is way more detailed in Appendix
COPACommon Sense, Language Understanding1K premise + causal questions with alternatives (common sense)PaperWebsite
CoQAIn-Context Reading Comprehension127K Conversational QA, from a given context (rationale must be provided too) - written by annotatorsPaperv1.0 from Data
DataImputationReal life task, Reasoning, Structured data8 structured datasets from several sources.Task: from a row of attributes with gaps, the model must fill the gaps (ex: extrapolating city from phone number, phone brand from its specs).
PaperData restaurant
Data Buy
See table 2 for all sources.
In HELM: use the subsets Buy and Restaurant, convert input to natural language, test accuracy.
Digits arithmetics (2D+, 2D-, 3D+, 3D-, …)ArithmeticBasic arithmetic tasks for n digits addition, subtraction, composite operations with 2K example eachTask: solve the mathPaperGithubAll links come from the lm-evaluation-harness/lm_eval/datasets/arithmetic
DROPArithmetic, In-Context Reading Comprehension55K adversarial questions which require 1) selecting relevant items from the text and 2) computing on them (sorting/counting/…) to get the correct answerTask: select and count to provide the correct answerPaperData
Dyck language_HELMSymbolic manipulation500 D_n words between 52 and 100 characters (”word” made of nested brackets/parenthesis) where the last i characters have been removed.Task: Predict the unique sequence of closing parentheses.PaperGithub
Also has a different version in BigBench
HellaSwagCloze/Completion60K adversarially filtered multiple choice Q/AChoose the correct next sentence (from captions or WikiHow)PaperGithub
HumanEvalCode task, Text-to-code164 hand written programming problems with function signature, docstring, body + unit testsAim is to complete function to pass unit testsPaperHugging Face
IMDBSentiment Analysis50K reviews from IMDB, with even positive (score ≥ 7) /negative (score ≤ 4) reviews (no neutral).Classify positive/negative review.PaperWebsite
LAMBADACloze/Completion10K Narrative contexts (from the BookCorpus) followed by a sentence where the last word is masked and must be predicted. Specifically built to force use of the context.Predict the last word.PaperZenodo
Language Modeling Evaluation_HELMLanguage ModelingCompilation in HELM of several datasets: WikiText-103, ThePile (particularly arXiv, BooksCorpus2, Enron Emails, PubMed Central, Wikipedia), TwitterAAE, ICE.Task: get conditional log probability of the full sequence (perplexity measure)PaperThe pile website
BLIMP Github
Wikitext data
Twitter AAE data
ICE data
LegalSupportEntailment, Real life task, Reasoning20K legal entailment scenarios, constructed from state/federal legal opinions (assertion is used as context, and 2 supporting sources (”see X, rule”) are selected at random).Task: finding which rule best supports assertion.PaperData
LinuxKernel_HELMGeneration, Memorisation2K randomly sampled functions from the Linux Kernel.Task: from a random number of lines beginning a function, the model must generate a follow up - measure exact and near-exact reproduction.PaperData
LSATAnalytical Reasoning, In-Context Reading Comprehension, Logical Reasoning10K questions from the Law School Admission Test (analytical, logical reasoning, and reading comprehension), with context.Answer the MCQA correctlyPaperGithub
Magellan BenchmarkReal life task, Reasoning, Structured data23 datasets from several sources containing entities with attributes. Dirty datasets contain deliberate errors, such as attributes being in the wrong column, misspellings, etc.Task: given two entities from two different tables, determine if they are the same or not.PaperGithubIt’s likely that Abt-Buy and Buy are the same dataset
MBPPCode task, Text-to-code1K entry-level Python crowd-sourced programming problems (description, solution, 3 unit test cases) - (58% mathematical, 43% list processing, 19% string processing, 9% integer sequences, and 2% other).Solve the Python programPaperGithub
Hugging Face
Also contains an edited version (400 items) with unambiguous prompts and good signatures (can be interesting to look at later to see the impact of prompts on code gen) + a MathQA-Python dataset (adaptation of the MathQA dataset)
MMLULanguage Understanding15K multi-choice Q/A manually collected from various online sources, on many topics (legal, philosophy, economics, psychology, STEM, medicine, etc, - at high school to professional level)Answer the MCQAPaperHugging Face
Github
Seems like a strong/high-quality baseline
MRF (Misinfo Reaction Frames)Generation, Misinformation capabilities200K pairs of claims from news headlines (climate change, covid 19, cancer illness, detailed sources in comments) + label (real/misinformation), the former annotated on veracity, likelihood of spread, writer intent by MTurk workers.Task: must either predict the gold label or generate likely writer intent/reader perception/…PaperGithub(Contains data from NELA-GT-2018-2020, SciDCC, Climate-FEVER, CoAID, CoronaVirusFacts/DatosCoronaVirusAlliance Database, ESOC Covid-19 Misinformation Dataset, DETERRENT)
MS MARCOQuestion Answering, Retrieval1M anonymised questions with free-form human generated answers (from relevant web document extracts), some with added rewriting.
Original paper contains 3 tasks: 1) generating the correct answer, if possible, 2) same but answer should make sense even without context, 3) ranking 1000 passages on how relevant they are for the question.PaperGithubContains extended descriptions of QA datasets in lit review.
In HELM, only the ranking task is looked at, and relevance is estimated looking at the log-likelihood of the prediction when asking “Does the passage answer the query?”
MS MARCO TREC, aka TREC 2019RetrievalDatasets derived from MS MARCO, edited for either passage or document retrieval tasks, either doing full retrieval or top-n reranking (100 for documents, 1000 for passages). (see MS MARCO)PaperData
Github
MultiRCLanguage Understanding, Question Answering6K multiple choice question over a diversity of topicsPaperData
NarrativeQAQuestion Answering, Retrieval47K free-form human generated questions and answers, linked to 1.5K books (Gutemberg project) and movie scripts (scraped) matched with plot summaries.
Task: from the summary or the story, answer or select the correct answer.PaperGithubFor long range context testing, we could use this dataset to do QA from the full stories. Could be interesting for anything conversational imo.
Natural QuestionsOpen domain/Closed book, Question Answering207K aggregated google search queries + annotated wikipedia sample answerPaperData
NewsQAQuestion Answering100K human generated QA pairs from 12K news articles (CNN). Questions were generated from title + summary, answers from question + article, then kept through a validation mechanism.
Likely intersects with CNN/DailyMail, as data extraction script was the same.
PaperGithub
OpenBookQACommon Sense, Reasoning6K sentences, science reasoning needing common sense knowledge to extrapolate to new situationsPaperData
PIQACommon Sense, Reasoning20K physical common sense reasoning situations,select the correct action to do from a context and answersPaperData
PopularBooksCorpus_HELMGeneration, Memorisation20 books from BooksCorpus which appear in a list of bestsellers.Task: from a random number of tokens beginning the first paragraph of the book, the model must generate a follow up - measure exact and near-exact reproduction.PaperData
QuACIn-Context Reading Comprehension100K questions in information seeking QA contexts (used Wikipedia to generate dataset)PaperData
RACEIn-Context Reading Comprehension100K questions from English reading comprehension exam for Chinese mid/high school studentsPaperData
RAFTReal life task, Text classificationCompilation of 11 datasets of naturally occurring classification tasks, of between 150 and 5K test items.Task: in few shot from 50 labeled examples, provide meaningful labels. (Domains: medical, finance, research, english language, law, physics, AI safety, social networks)PaperHugging FaceCorpus: (ADE Corpus v2, Banking77, NeurIPS 2020 impact statement risks, OneStopEnglish, Overrruling, Systematic review inclusion, TAI safety research, Terms of Service, TweetEval Hate, Twitter complaints, + Semiconductor org types, created for this)
RealToxicityPromptsGeneration, Toxicity detection100K natural occurring sentences (selected from OpenWebText corpus, basically = reddit, and scored for toxicity with the PerspectiveAPI) split in two to create a prompt and continuation.Task: generate the continuation from the sentence start, then toxicity evaluated with PerspectiveAPI.PaperData
Github (the repo lacks a lot of info)
ReCoRDLanguage Understanding120K passage/cloze query/answer examples from news (CNN, DailyMail) with human filteringPaperData
RTELanguage Understanding3K compilation of competition data on entailementPaperData
SAT analogiesLanguage Understanding374 SAT analogy problem prior to 2005 (a is to b what c is to multiple choice questions; words are not the most frequent)PaperData dev
Data test
SIQAQuestion Answering
SQuADv2In-Context Reading ComprehensionCombines SQuAD with 50K unanswerable questionsfrom a context, give an answer, but only if possiblePaperGithub
StoryClozeCloze/Completion, Common Sense50K 5-sentences commonsense storieschoose the correct endingPaperHugging Face
StrategyQACommon Sense, Reasoning2.8K questions needing reasoning from implicit knowledgePaperBest results with an external calculator added
Synthetic reasoning (natural)Logical Reasoning, ReasoningSynthetic data generated on the fly, containing a set of synthetic rules (conditional statements), facts (attributes), and the logical gold output.PaperCan be generated with GithubAlso called rule_induct in HELM
Synthetic reasoning (symbolic)_HELMLogical Reasoning, Symbolic manipulationSynthetic data generated on the fly using a pattern template.Either test if the model is able to identify patterns (”beach + beach - pear” has “A + A - B” as pattern) or if the model can substitute strings in a given pattern.PaperCan be generated with Github
TriviaQAOpen domain/Closed book, Question Answering95K trivia QA (compositional questions, syntactic variability)PaperData
TruthfulQAQuestion Answering817 questions about tricky factual claims (common misconceptions, falsehoods, …) over 38 categories, with true and false reference answers + a source to support true answers (+ 380 added questions).PaperGithub
TyDiQA-GoldPMultilingual, Question Answering204K multilingual QA pairs (unconstrained question elicitation from prompts, then Wikipedia article retrieval, and specific answer selection in the article if possible) (en, ar, ben, fin, ind, ja, ko, ru, tel, th and kiswahili).MCQAPaperGithubDataset generated can present interesting underspecification of questions and mismatch between question and answers language level. Might be harder than other datasets
Web QuestionsOpen domain/Closed book, Question AnsweringExtracted 100K “Wh?” questions from Google Search API, then annotated by MTurkers - I suspect answers are partially out of dateMCQAPaperWebsite
WebNLGGeneration, Verbalization13K mappings between triples (subject, property, object, constructed from DBPedia, which is a KB from Wikipedia) and sentence verbalization (by crowd workers), about specific topics (astronauts, universities, monuments, buildings, characters from comics, food, airports, sports teams, written works).Task: verbalize in a grammatical way.PaperHugging Face
There was a sentence selection for fluency and the sentences generated are relatively simple, but there is no description of annotators/crowdsourcers origins > maybe some data is not in “standard English”.
WiCLanguage Understanding7K, classification of whether a word occurring in two different contexts has the same meaning or notPaperSite
WikiFact_HELMCloze/Completion12 domains with 1K triples (subject, relation, object) sampled from Wikipedia and cleaned.Task: predict missing item in the sentence made of the relation.PaperCodalab
Github
WikiLinguaGeneration, Multilingual, Summarization43K article/summary pairs constructed from WikiHow in 18 languages (on the site, articles are written with a summary sentence + detailed paragraph per step: in the dataset, summaries are the concatenation of the summary sentences, and articles of the detailed paragraphsSummarizationPaperGithubPalm: Prefixed with a prompt, truncated article to 2048 tokens
I suspect data creation can leads to very “robotic” language for the summary baseline, which could underscore more fluid summaries - though ROUGE shouldn’t be too prone to that).
WinogenderBias detection
WinogradReasoning, Winograd273 to 285 examples where one must disambiguate who/what a pronoun is referring to on sentence specially constructed to be ambiguous to statistics not to humansDisambiguation of pronounPaperWebsiteNot sure if GPT3 was evaled on this one or the SuperGLUE one
WinoGrandeReasoning, Winograd43K sentences Adversarial WinogradPaperWebsite
WSCLanguage Understanding, WinogradWinoGrad Schema Challenge (see Winograd)PaperWebsite
XSUMSummarization226K news articles (BBC, 2010 to 2017) matched with their single sentence summary (comes from the article). Task: Summarize. (Domains: News, Politics, Sports, Weather, Business, Technology, Science, Health, Family, Education, Entertainment and Arts)PaperGithub
XSumGeneration, Summarization226K news summary/article pairs from the BBC (2010 - 2017) extracted from the WayBack machinePaperHugging FaceCould be interesting to manually check if the model recent knowledge creates discrepancies in the summaries of old news.

Dataset ideas to manually reproduce

Evaluation nameTask typeTask contentSourceDatasetComments
✍️ GSM8K-PythonCode task, Text-to-codePython version of the GSM8K dataset (8.5K grade school math problems)PaperN/A
✍️ MRFGeneration, Manual evaluation, Misinformation capabilities250 headlines extracted from the MRF dataset, grouped in 80 clusters by thesis. Task: from the thesis + 5 headlines, the model must generate plausible headlines which supports the thesis. Annotators evaluate if 1) the headline supports the thesis and 2) looks real.PaperDataSee report page 6 for a detailed explanation of the original process, plus sections 8.5.2, E.5, and 5.5 in the HELM paper.
✍️ News article generationGenerationGenerated 25 articles from titles and subtitles, 80 humans had to classify if generated or originalPaper
✍️ Numeracy PredictionSymbolic manipulation“requires the model to perform symbolic regression given a few examples, and apply the number relationship to a new input”PaperGithub
✍️ SVG datasetsCould construct an SVG dataset to see if models can indeed generate or interpret SVG drawingsTwitter thread
✍️ Theory of the mind datasetsCould likely be easy to generatePaper
✍️ Wedging promptsGeneration, Manual evaluation, Misinformation capabilities11 prompts with specific intent (ex: influence voting behaviors, target specific groups by generate pro/anti X rhetoric) augmented with 3 examples. Task: generate follow up examples.
PaperDataIn HELM: use manual evaluation to determine if the message generate 1) addresses targeted group; 2) supports desired message; 3) is divisive
✍️ Word scramblingSymbolic manipulation10K examples for 5 tasks of 5 character manipulation tasks (word with cycled letters, anagrammed, random insertions, reversed). Model needs to recover the original wordPaperEasy to generate/automate, see Section 3.9.2 of GPT3 paper