VarBench: Robust Language Model Benchmarking Through Dynamic Variable Perturbation
June 27, 2024 ยท View on GitHub
VarBench is a new benchmark with dynamically-valued variables to help deal with the problem of benchmark data contamination. Currently, the following tasks are supported: GSM8K, CommonsenseQA, AI2 Reasoning Challenge, and TruthfulQA. We are planning to extend VarBench to include other complex tasks such as AGIEval and MMLU. To keep the results comparable, we use the lm-eval harness from EleutherAI.
Constructing VarBench
The default extracted variables are stored in ./gen_data/${dataset}/${split}_${dataset}_${model}.jsonl, where dataset is one of gsm8k, truthfulqa, csqa, arc_challenge, split is one of test, dev, validation (depending on each dataset), and model is on default gpt4o in this work. In this section we will use them to construct new test sets.
Note: These files has been manually corrected and verified. Therefore, try not to overwrite these files.
Constructing GSM+
we need to sample variable values. By selecting different random seeds, we can create a new set of values for the variables defined in the previous step.
python sample.py \
--data_path ./gen_data/gsm8k/test_gsm8k_gpt4o.jsonl \
--save_dir ./gen_data/gsm8k/sample_42 \
--task generate_test_set_gsm8k \
--seed 42
Constructing TruthfulQA+
python sample.py \
--data_path ./gen_data/truthfulqa/validation_truthfulqa_gpt4o.jsonl \
--save_dir ./gen_data/truthfulqa/sample_42 \
--task generate_dev_set_csqa \
--seed ${seed}
Constructing CommonsenseQA+ (CSQA+)
python sample.py \
--data_path ./gen_data/csqa/dev_csqa_gpt4o.jsonl \
--save_dir ./gen_data/csqa/sample_42 \
--task generate_dev_set_csqa \
--seed ${seed}
Constructing ARC+
python sample.py \
--data_path ./gen_data/arc/challenge/test_arc_challenge_gpt4o.jsonl \
--save_dir ./gen_data/arc/challenge/sample_42 \
--task generate_test_set_arc \
--seed ${seed}
Extracting Variables
This step is optional if you only wish to select new variable values for the benchmark
The first step is to extract variables from the original GSM8K test set to create a delexicalized version, and construct code solutions for each problem.
python generate.py \
--model_name_or_path "gpt-4o" \
--top_p 0.3 \
--save_dir ./gen_data/gsm8k \
--save_filename gsm8k_test_gpt4o.jsonl \
--task generate_gsm8k
This step will add three components for each datapoint, under keys variables, question_delex, func correspondingly.
Note: that the save_filename is different from the default
test_gsm8k_gpt4o.jsonl, avoiding overwrite the original annotation.
Then we prompt gpt again to create value range for each variable. On default, we load variables from gsm8k_test_gpt4o.jsonl, which is generated in the previous step.
python generate.py \
--model_name_or_path "gpt-4o" \
--top_p 0.3 \
--save_dir ./gen_data/gsm8k \
--save_filename gsm8k_test_gpt4o_range.jsonl \
--task generate_input_range
This step will add component under key input_range for each data point.
Alternative Perturbation
Alternative perturbations are created in other_process.py file.
- For
gsm8k, we conduct paraphrasing:
python other_process.py \
--task paraphrase_gsm8k \
--save_dir ./gen_data/gsm8k/paraphrase/ \
--save_filename "test.jsonl" \
--seed 2
- For
arcandcsqawe conduct shuffling:
python other_process.py \
--task shuffle_arc \
--save_dir "./gen_data/arc/challenge/shuffle/" \
--save_filename "test.jsonl" \
--seed 40
But for csqa, we conduct shuffling during sampling by change line 144 in sample.py to be True.
- For
truthfulqa, we rewrite questions by settingnew_questioninline 293insample.pyto beTrue.
Citation
If you end up using this benchmark or the accompanying code, please cite the following paper.
@inproceedings{qian2024varbench,
title = "{VarBench}: Robust Language Model Benchmarking Through Dynamic Variable Perturbation",
author = "Qian, Kun and
Wan, Shunji and
Tang, Claudia and
Wang, Youzhi and
Zhang, Xuanming and
Chen, Maximillian and
Yu, Zhou",
booktitle = "arXiv preprint",
month = june,
year = "2024",
}