HPC-Code-translation-and-generation
December 1, 2023 ยท View on GitHub
The main task of this github is to test the translation and generation performance of fortran HPC code using some existing open source projects
CodeXGLUE : https://github.com/microsoft/CodeXGLUE
ChatGPT : https://openai.com/blog/chatgpt/
The origianl Fortran HPC dataset can be downloaded from https://github.com/OMPI-X/epcc-mixedmode
Task1: Code to Code Translation
Fortran to C++ translation by using Fine-tuned CodeBERT model
The fine-tuned CodeBERT model is in https://drive.google.com/file/d/177B19VLstHLXQYAdjce29EDZ1AWEUTV-/view?usp=share_link
Test BLEU score
cd path/to/code-to-code-trans/code/
python run.py \
--do_test \
--model_type roberta \
--model_name_or_path roberta-base \
--config_name roberta-base \
--tokenizer_name roberta-base \
--load_model_path /path/to/fine-tuned model \
--dev_filename /path/to/valid.fortran2C.txt.f90,/path/to/valid.fortran2C.txt.C \
--test_filename /path/to/test.fortran2C.txt.f90,/path/to/test.fortran2C.txt.C \
--output_dir /path/to/your output file \
--max_source_length 512 \
--max_target_length 512 \
--beam_size 5 \
--eval_batch_size 16
Test CodeBLEU score
cd path/to/CodeBLEU/
python calc_code_bleu.py --refs /path/to/your output file/test_1.gold --hyp /path/to/your output file/test_1.output --lang c_sharp --params 0.25,0.25,0.25,0.25
Fortran to C++ translation by using ChatGPT
The question provided to chatGPT: Please help me to translate the following C code (The C code in our test dataset) to Fortran code. NOTE: ChatGPT may generate different answers each time. The answer I got is shown in /Code to Code Translation dataset/ChatGPT_test_answer.output
Test BLEU score and CodeBLEU score
python calc_code_bleu.py --refs /path/to/test_1.gold --hyp /path/to/ChatGPT_test_answer.output --lang c_sharp --params 0.25,0.25,0.25,0.25
Java to C# translation by using CodeBERT model
The data is from this paper: https://arxiv.org/abs/2102.04664
Task2: Code Generation Based on Text
Text to Java Code generation by using CodeGPT
The data is from this paper: https://arxiv.org/abs/2102.04664
Text to Fortran HPC Code generation by using fine-tuned CodeGPT model.
Fine tuned model
DATADIR=../dataset/Fortran
OUTPUTDIR=../save/Fortran
PRETRAINDIR=microsoft/CodeGPT-small-java-adaptedGPT2 # will download pre-trained CodeGPT model
LOGFILE=text2code_concode.log
PER_NODE_GPU=1
python -m torch.distributed.launch --nproc_per_node=$PER_NODE_GPU run.py \
--data_dir=$DATADIR \
--langs=$LANG \
--output_dir=$OUTPUTDIR \
--pretrain_dir=$PRETRAINDIR \
--log_file=$LOGFILE \
--model_type=gpt2 \
--block_size=512 \
--do_train \
--node_index 0 \
--gpu_per_node $PER_NODE_GPU \
--learning_rate=5e-5 \
--weight_decay=0.01 \
--evaluate_during_training \
--per_gpu_train_batch_size=6 \
--per_gpu_eval_batch_size=12 \
--gradient_accumulation_steps=2 \
--num_train_epochs=30 \
--logging_steps=100 \
--save_steps=100 \
--overwrite_output_dir \
--seed=42
inference
DATADIR=../dataset/Fortran
OUTPUTDIR=../save/Fortran
PRETRAINDIR=../save/Fortran/checkpoint-last
LOGFILE=text2code_concode_infer.log
python -u run.py \
--data_dir=$DATADIR \
--langs=$LANG \
--output_dir=$OUTPUTDIR \
--pretrain_dir=$PRETRAINDIR \
--log_file=$LOGFILE \
--model_type=gpt2 \
--block_size=512 \
--do_infer \
--logging_steps=100 \
--seed=42
NOTE: Our fortran HPC is not sufficient to support training this large model, Therefore the generated results are not ideal. If you want to test the text-code generation in a large Java dataset, Check the original project https://github.com/microsoft/CodeXGLUE/tree/main/Text-Code/text-to-code. Although their result is also not ideal :).
Text to Fortran HPC Code generation by using ChatGPT.
Question provided to ChatGPT, take FT calculation as an example:
Please help me to write some Fortran HPC code to implements the time integration of a three-dimensional partial differential equation using the Fast Fourier Transform.
Please add OpenMP (Open Multi-Processing) directives into the code to make it run in parallel.
Please add MPI (Message Passing Interface) calls into the code to make it run in parallel in a cluster.
python calc_code_bleu.py --refs /path/to/ChatGPT_results/result.gold --hyp /path/to/ChatGPT_results/ChatGPT_result.output
Task3: Create our own model for the HPC code translation
Our paper is avaliable at http://arxiv.org/abs/2307.07686.
For the detail reproduce steps, please check this Colab: https://drive.google.com/file/d/1QqkGskaPPUKvjzwn_dmaV9z3yB9z2Vyu/view?usp=sharing
This folder contains training and testing dataset and a simple test script.
We collect data form three different source:
You can also download the dataset from : My Huggingface
Here is one data pair example:

We will add more data pairs in the future and will add a new "nature language" column for code generation task.
Reproduce our results
- Finetune the model by using deepspeed
deepspeed --master_port 12345 main.py \
--data_path Bin12345/HPC_Fortran_CPP \
--model_name_or_path path/to/starcoder_model \
--per_device_train_batch_size 1 \
--per_device_eval_batch_size 1 \
--max_seq_len 128 \
--learning_rate 9.65e-6 \
--weight_decay 0.1 \
--num_train_epochs 3 \
--gradient_accumulation_steps 2 \
--lr_scheduler_type cosine \
--num_warmup_steps 0 \
--seed 1234 \
--zero_stage $ZERO_STAGE \
--deepspeed \
--output_dir $OUTPUT \
&> $OUTPUT/training.log
If you want to finetune the other models (for example OPT models), you just need to change the --model_name_or_path from path/to/starcoder_model to path/to/OPT_models.
- Use the finetuned model to generate the prompts. Change the
model = OPTForCausalLM.from_pretrained("facebook/opt-2.7b").to('cuda:2')
tokenizer = AutoTokenizer.from_pretrained("facebook/opt-2.7b")
Inside the Simple_test_script.py to
model = OPTForCausalLM.from_pretrained("path/to/the/fintuned_model").to('cuda:2')
tokenizer = AutoTokenizer.from_pretrained("path/to/the/fintuned_model")
Then run:
python Simple_test_script.py
You can try our simple test scripts. And for different models, there might be slightly difference.
- Then test the CodeBlue Score
cd CodeBLUE
python calc_code_bleu.py --refs path/to/groundtruth.txt --hyp path/to/the_generated_answers/by_the_finetuned_model
Reference
@article{lu2021codexglue,
title={Codexglue: A machine learning benchmark dataset for code understanding and generation},
author={Lu, Shuai and Guo, Daya and Ren, Shuo and Huang, Junjie and Svyatkovskiy, Alexey and Blanco, Ambrosio and Clement, Colin and Drain, Dawn and Jiang, Daxin and Tang, Duyu and others},
journal={arXiv preprint arXiv:2102.04664},
year={2021}
}
@inproceedings{lei2023creating,
title={Creating a Dataset for High-Performance Computing Code Translation using LLMs: A Bridge Between OpenMP Fortran and C++},
author={Lei, Bin and Ding, Caiwen and Chen, Le and Lin, Pei-Hung and Liao, Chunhua},
booktitle={High Performance Extreme Computing Conference (HPEC)},
year={2023},
organization={IEEE}
}