eval_math
May 27, 2024 ยท View on GitHub
Dataset
datasets/MATH/math.txt. math test set containing 458 questions.
In the text, the content before [SEP] is the question, and the content after [SEP] is the standard answer to that question.
Evaluation
Introduction
examples/eval_math_2x32B.sh. The evaluation results for math could be obtained by running this program.
The variables in the code should be set as follows:
| Variable name | Description |
|---|---|
CHECKPOINT_PATH | the path that saves the checkpoint to be evaluated. |
TOKENIZER_MODEL_PATH | the path that saves the tokenizer. |
MATH_DATA | the path that saves the evaluation set. |
OUTPUT_PATH | the path that saves the evaluation results. |
Usage
Run the following command to evaluate the model's performance on the test set:
bash -x examples/eval_math_2x32B.sh
Result
The evaluation result will be saved in the path of OUTPUT_PATH. In the text, the content before [SEP] is the question, and the content after [SEP] is the answer of our model to that question.
Accuracy
Introduction
tasks/MATH/score_math.py. The accuracy of evaluation results for math could be obtained by running this program.
The path variables in the code should be set as follows:
| Variable name | Description |
|---|---|
origin_file_path | Path of evaluation set file. |
eval_file_path | Path for saving the evaluation result file. |
txt_eval_res_dir | Path for storing distinguished results. Files ending with _true contain correctly results, while those ending in _false contain incorrectly results. |
Usage
Run the following command to evaluate the model's performance on the test set:
python score_math.py
Result
"Number of correct answers" and "Number of incorrect answers" respectively represent the number of correct answers and the number of incorrect answers, while "accuracy" indicates the accuracy .