README.md
April 29, 2025 ยท View on GitHub
File organization
Download all the data and the fine-tuned model from the following links:
https://drive.google.com/file/d/1h9vl_gb5HXp68S3bkIGs71Zne3M4DiFR/view?usp=sharing
https://drive.google.com/file/d/1Q738E4NhUuyrnlWNhnxt-XpAYGt67UBZ/view?usp=sharing
Experiment replication steps
Step 1: Knowledge guided prompt learning.
- Input: Training data.
- Output: Knowledge guided prompt template.
- Method: SimCSE, DeepSeek-7b.
- Source code: bert-base-cased2 and simcse in Pretrained_LMs, parser, myTokenizers.py listed in the Code section.
- Step 1.1. Text prompt reconstructs the request text into a generative task.
- Method: SimCSE.
- Step 1.2. Code prompt refactors the code snippets in text into the form of data flow graph.
- Source code: parser.
- Step 1.3. Get the output from fine-tuned DeepSeek to get the knowledge guidance.
Step 2: Training.
- Input: The knowledge guided prompt template (output of Step 1).
- Output: Predicted word list.
- Method: Mask Language Model (MLM).
- Source code: model.py, train.py listed in the Code section.
Step 3: Answer engineering.
- Input: Predicted word list.
- Output: Final labels.
- Source code: answer_engineering.py listed in the Code section.
Code
- Pretrained_LMs comprises two pretrained models BERT and SimCSE.
- parser comprises several parsers for converting source code written in programming languages into PDG.
- wandb is a visualization tool designed for machine learning that can be used to track, visualize, and share experimental results.
- answer_engineering.py for getting the final labels.
- codelanguage.py for identifing programming languages.
- config.py for configuring the parameters used to set up and run the training and evaluation process of the machine learning model.
- main.py as a master script for our method.
- model.py for implementing Masked Language Model.
- myDataset.py for building the dataset.
- myTokenizers.py for preprocessing.
- test.py for testing.
- test_main.py for the complete training and evaluation process.
- train.py for training.
- utils.py for setting random seeds.
Experiment environment
- GPU: NVIDIA Titan XP*4
- CPU: IntelR XeonR E5-2650 v4 @2.20GHz
- Memory Capacity: 64G*4
- CUDA Version: 11.0
- Python Version: 3.5
- Pytorch Version: 1.7