README
March 19, 2025 · View on GitHub
Content
bleu.py: calculate the bleu score;generator_metric.py: calculate the performance (BLEU & Exact match) of generator;model.py: Seq2Seq model and beam search;run.py: training, validation and test script;evaluate.sh: bash script for evaluation;run.sh: bash script for train and validation;model/: folder save models and outputs by language, available for download at Huggingface;- model/{lang}/test_0_pred_gold.json: for each sample, have 10 prediction candidate and 1 ground-truth
- model/{lang}/checkpoint-best-bleu/pytorch_model.bin: trained model
dataset/: folder save datasets by langauge, available for download at Huggingface.- dataset/{lang}/train.jsonl: training dataset
- dataset/{lang}/dev.jsonl: validation dataset
- dataset/{lang}/test.jsonl: test dataset
Download
Download model and dataset by runing download script:
bash download.sh
Quick start
Please refer to Quick_start.ipynb for quick start.
As baseline
-
Parpare
custon_input.jsonlfile as input, each element of format:⚠️: The
edit_labelsandcode_windowshould have the same length{ "edit_labels": list["replace" | "keep" | "add"]. Each element is an edit operation label, "code_window": list[str]. Each str element is a line of code, "after_edit": list[str]. If code window contains code block to replace then after_edit is code to replace that block. If code window require add action then after_edit is code to insert. "prompt": str | None. The natural language description of the edit, "prior_edits": list[dict]. Each dict has the following format: { "code_before": list[str]. Each str element is a line of code before editing, "code_after": list[str]. Each str element is a line of code after editing } } -
Run script
transform.pyto transform your customized dataset into CoEdPilot prompt format, where the three arguments represent the ratio for splitting your dataset into the training set, validation set, and test set.python transform.py 7 2 1 -
You can now run or evaluate on your customized dataset
Usage
# Train
bash run.sh
# Evaluate
bash evaluate.sh
Key results
| Language | EM@1 | EM@3 | EM@5 | EM@10 | BLEU@1 | BLEU@3 | BLEU@5 | BLEU@10 |
|---|---|---|---|---|---|---|---|---|
| Typescript | 41.58 | 46.86 | 48.75 | 50.65 | 61.75 | 70.31 | 71.99 | 73.68 |
| Javascript | 41.83 | 47.50 | 49.31 | 50.99 | 60.70 | 69.71 | 71.37 | 73.02 |
| Java | 40.69 | 46.87 | 48.78 | 50.51 | 60.54 | 68.35 | 70.11 | 71.73 |
| Python | 33.48 | 38.52 | 40.41 | 42.09 | 57.59 | 65.65 | 67.47 | 69.11 |
| Go | 48.94 | 55.09 | 57.18 | 59.16 | 65.37 | 71.96 | 73.47 | 74.98 |