eval\_humaneval

May 28, 2024 · View on GitHub

Dataset

datasets/HUMANEVAL/HumanEval.jsonl.gz The original English version of the HumanEval dataset containing 164 questions.

datasets/HUMANEVAL/HumanEval-textprompts.jsonl The Chinese version of the HumanEval dataset obtained by translation with the aid of gpt-4 model.

datasets/HUMANEVAL/HumanEval-instructions.jsonl The HumanEval dataset in instruction style.

datasets/HUMANEVAL/HumanEval-instructions-fewshot.jsonl The HumanEval dataset in instruction style with few-shot prompt.

Evaluation

Introduction

examples/eval_humaneval.sh. The evaluation results for HumanEval on Yuan2.0-M32 model could be obtained by running this program.

Before running the evaluation program, you have to specify the following checkpoint_path in bash script by yourself.

Variable name	Description
`CHECKPOINT_PATH`	the path that saves the checkpoint to be evaluated.

Requirement

Make sure HumanEval program is installed befere running the HumanEval evaluation on Yuan2.0-M32 checkpoint.

$ git clone https://github.com/openai/human-eval
$ pip install -e human-eval

After HumanEval program is installed, we shall go to this script,

/usr/local/lib/python3.10/dist-packages/human_eval-1.0-py3.10.egg/human_eval/execution.py

and make the following change on "check_program" variable in "check_correctness" function, to ensure there is no duplicate function signature in generated codes.

check_program = (
    #problem["prompt"] +
    completion + "\n" +
    problem["test"] + "\n" +
    f"check({problem['entry_point']})"
)

Also, if you are new to HumanEval, you have to delete the extra "#" in "check_program", right before the line "exec(check_program, exec_globals)".

# WARNING
# This program exists to execute untrusted model-generated code. Although
# it is highly unlikely that model-generated code will do something overtly
# malicious in response to this test suite, model-generated code may act
# destructively due to a lack of model capability or alignment.
# Users are strongly encouraged to sandbox this evaluation suite so that it
# does not perform destructive actions on their host or network. For more
# information on how OpenAI sandboxes its code, see the accompanying paper.
# Once you have read this disclaimer and taken appropriate precautions,
# uncomment the following line and proceed at your own risk:
                         exec(check_program, exec_globals)
                result.append("passed")
            except TimeoutException:
                result.append("timed out")
            except BaseException as e:
                result.append(f"failed: {e}")

Run the following commands to evaluate the Yuan2.0-M32 model's performance on HumanEval dataset. Before running the bash script, you shall change directory to main directory of 'Yuan2.0-M32', and you have to specify the the checkpoit path where you store the checkpoint in the bash script.

Evaluate MOE model on HumanEval dataset.

cd <Specify Path>/Yuan2.0-M32/
bash examples/eval_humaneval_2x32.sh

Results

The evaluation results will be gathered in samples.jsonl in $OUTPUT_PATH. After the generation of all the tasks done, the "evaluate_functional_correctness" function of HumanEval would automatically evaluate the results and return the accuracy.

Dataset

Evaluation

Introduction

Requirement

Usage

Results