README

March 19, 2025 ยท View on GitHub

Contents

  • bleu.py: calculate the bleu score;
  • locator_metric.py: calculate the performance (EM, Acc, Prec, Recall, F1) of locator;
  • model.py: masked language modelling model;
  • run.py: training, validation and test script;
  • evaluate.sh: bash script for evaluation;
  • run.sh: bash script for train and validation;
  • model/: folder save models and outputs by language, available for download at Huggingface;
    • model/{lang}/test_1.gold: ground truth
    • model/{lang}/test_1.output: prediction
    • model/{lang}/checkpoint-best-bleu/pytorch_model.bin: trained model
  • dataset/: folder save datasets by langauge, available for download at Huggingface.
    • dataset/{lang}/train.jsonl: training dataset
    • dataset/{lang}/dev.jsonl: validation dataset
    • dataset/{lang}/test.jsonl: test dataset

Download

Download model and dataset by runing download script:

bash download.sh

Usage

# Train
bash run.sh
# Evaluate
bash evaluate.sh

Quick start

Please refer to Quick_start.ipynb for quick start.

As baseline

  1. Prepare custom_input.jsonl file as input, each element of format:

    Note

    The edit_labels and code_window should have the same length

    {
        "edit_labels": list["replace" | "keep" | "add"]. Each element is an edit operation label,
        "code_window": list[str]. Each str element is a line of code,
        "prompt": str | None. The natural language description of the edit,
        "prior_edits": list[dict]. Each dict has the following format: 
            {
                "code_before": list[str]. Each str element is a line of code before editing,
                "code_after":  list[str]. Each str element is a line of code after editing
            }
    }
    
  2. Run script transform.py to transform your customized dataset into CoEdPilot prompt format, where the three arguments represent the ratio for splitting your dataset into the training set, validation set, and test set.

    python transform.py 7 2 1
    
  3. You can now run or evaluate on your customized dataset

Key results

LanguageEMAccPrecRecalF1
Typescript76.7595.2386.2184.2585.20
Javascript74.6294.8986.6283.8885.18
Java78.0095.3787.9985.9986.96
Python74.4494.4885.0382.6483.79
Go80.1895.7988.9987.3288.14