README.md

April 29, 2025 · View on GitHub

File organization

Download all the data and the fine-tuned model from the following links:

Step 1: Knowledge guided prompt learning.

Input: Training data.
Output: Knowledge guided prompt template.
Method: SimCSE, DeepSeek-7b.
Source code: bert-base-cased2 and simcse in Pretrained_LMs, parser, myTokenizers.py listed in the Code section.
Step 1.1. Text prompt reconstructs the request text into a generative task.
- Method: SimCSE.
Step 1.2. Code prompt refactors the code snippets in text into the form of data flow graph.
- Source code: parser.
Step 1.3. Get the output from fine-tuned DeepSeek to get the knowledge guidance.

Step 2: Training.

Step 3: Answer engineering.

Pretrained_LMs comprises two pretrained models BERT and SimCSE.
parser comprises several parsers for converting source code written in programming languages into PDG.
wandb is a visualization tool designed for machine learning that can be used to track, visualize, and share experimental results.
answer_engineering.py for getting the final labels.
codelanguage.py for identifing programming languages.
config.py for configuring the parameters used to set up and run the training and evaluation process of the machine learning model.
main.py as a master script for our method.
model.py for implementing Masked Language Model.
myDataset.py for building the dataset.
myTokenizers.py for preprocessing.
test.py for testing.
test_main.py for the complete training and evaluation process.
train.py for training.
utils.py for setting random seeds.