eval\_humaneval
May 28, 2024 · View on GitHub
Dataset
datasets/HUMANEVAL/HumanEval.jsonl.gz The original English version of the HumanEval dataset containing 164 questions.
datasets/HUMANEVAL/HumanEval-textprompts.jsonl The Chinese version of the HumanEval dataset obtained by translation with the aid of gpt-4 model.
datasets/HUMANEVAL/HumanEval-instructions.jsonl The HumanEval dataset in instruction style.
datasets/HUMANEVAL/HumanEval-instructions-fewshot.jsonl The HumanEval dataset in instruction style with few-shot prompt.
Evaluation
Introduction
examples/eval_humaneval.sh. The evaluation results for HumanEval on Yuan2.0-M32 model could be obtained by running this program.
Before running the evaluation program, you have to specify the following checkpoint_path in bash script by yourself.
| Variable name | Description |
|---|---|
CHECKPOINT_PATH | the path that saves the checkpoint to be evaluated. |
Requirement
Make sure HumanEval program is installed befere running the HumanEval evaluation on Yuan2.0-M32 checkpoint.
$ git clone https://github.com/openai/human-eval
$ pip install -e human-eval
After HumanEval program is installed, we shall go to this script,
/usr/local/lib/python3.10/dist-packages/human_eval-1.0-py3.10.egg/human_eval/execution.py
and make the following change on "check_program" variable in "check_correctness" function, to ensure there is no duplicate function signature in generated codes.
check_program = (
#problem["prompt"] +
completion + "\n" +
problem["test"] + "\n" +
f"check({problem['entry_point']})"
)
Also, if you are new to HumanEval, you have to delete the extra "#" in "check_program", right before the line "exec(check_program, exec_globals)".
# WARNING
# This program exists to execute untrusted model-generated code. Although
# it is highly unlikely that model-generated code will do something overtly
# malicious in response to this test suite, model-generated code may act
# destructively due to a lack of model capability or alignment.
# Users are strongly encouraged to sandbox this evaluation suite so that it
# does not perform destructive actions on their host or network. For more
# information on how OpenAI sandboxes its code, see the accompanying paper.
# Once you have read this disclaimer and taken appropriate precautions,
# uncomment the following line and proceed at your own risk:
exec(check_program, exec_globals)
result.append("passed")
except TimeoutException:
result.append("timed out")
except BaseException as e:
result.append(f"failed: {e}")
Usage
Run the following commands to evaluate the Yuan2.0-M32 model's performance on HumanEval dataset. Before running the bash script, you shall change directory to main directory of 'Yuan2.0-M32', and you have to specify the the checkpoit path where you store the checkpoint in the bash script.
Evaluate MOE model on HumanEval dataset.
cd <Specify Path>/Yuan2.0-M32/
bash examples/eval_humaneval_2x32.sh
Results
The evaluation results will be gathered in samples.jsonl in $OUTPUT_PATH. After the generation of all the tasks done, the "evaluate_functional_correctness" function of HumanEval would automatically evaluate the results and return the accuracy.