SlimCode
October 16, 2023 ยท View on GitHub
This repo provides the code for reproducing the experiments in SlimCode. SlimCode is a program simplification method that consider more on the nature of the code.
Requirments
Quick Start
Prepare the dataset
We use the same dataset as CodeBERT and Dietcode. But we remove all the comments in the code to make more code can be converted to AST and remove the code that can't be converted to AST after removing the comments. The original data can be downloaded from CodeBERT and our preprocessed data can be download from SlimCode.The final specific data volume is summarized as follows.
| Dataset Volume | Code Search | Code2nl | ||||
| train | valid | test | train | valid | test | |
| CodeBERT/DietCode | 908886 | 30655 | 26909 | 164923 | 5183 | 10955 |
| SlimCode | 904817 | 30437 | 26780 | 164813 | 5183 | 10948 |
Process the dataset
Random process
We remove tokens from the code randomly and we reference the code from DietCode.Our modified code can be found here.We use the code to remove 10%-50% tokens from the given code snippet.
Category process
We divide the tokens in the code into 3 levels: lexical level, syntactic level and semantic level. The lexical level includes symbol tokens and identifiers. Syntactic level includes structure tokens, signature tokens and invocation tokens. The semantic level includes the tokens in PDG. In the first two levels,we recognize identifiers,structure tokens,signature tokens and invocation tokens from the code by AST.And we use JavaParser to convert the code into AST and then we remove the tokens from the code by AST independently.The relationship of our category and the node type can be seen in the table.
| level | category | node type |
|---|---|---|
| lexical | symbol tokens | None |
| identifier | NameExpr node VariableDeclarationExpr |
|
| syntactic | structure tokens |
TryStmt node IfStmt node SwitchStmt node WhileStmt node DoStmt node ForStmt node ForeachStmt node |
| signature tokens | MethodDeclaration node | |
| invocation tokens | MethodCallExpr node | |
| semantic | PDG tokens | javaDependencyGraph |
For the last level,we modfied the javaDependencyGraph to generate PDG for a large number of functions in our dataset.Our modified code can be found here.Because we remove the tokens in the code in line,the function of the code in the data should include "\n" in the end of the line so that we can remove the code by PDG in line.So we provide our preprocessed dataset for PDG.Our preprocessed dataset can be found here.
DietCode process
We modified the code of DietCode to process the dataset in diffient removal percent.Our modified code can be found here.After the dataset is processed by DietCode,then we feed them to CodeBERT and CodeT5 for codesearch and code2nl.
SlimCode process
Based on the result of category removal,we proposed SlimCode.Its core idea is to prioritize removing words that have less impact on downstream tasks according to our results.Our removal order is symbol tokens > the tokens beyond our category > not identifier tokens in structure > not identifier tokens in invocation > identifiers not in structure and invocation > identifiers in invocation > identifiers in structure > signature tokens.Similarly,we get removal order by AST and then remove them in the code in different removal percent.Our code can be found here.It's necessary to have a JDK8 in your computer and then you can use the follow command to compile the code.
./jdk1.8.0_341/bin/javac -classpath ./javaparser-core-3.6.5.jar -d bin SlimCode.java RemoveAll.java SpanContent.java -Xlint:unchecked
And then you can use the follow command to run the code.
./jdk1.8.0_341/bin/java -classpath ./javaparser-core-3.6.5.jar:bin/ SlimCode
Finetune
After processing the dataset, you can feed the data into codebert,codet5 for codesearch and code summarization.
Code Search
CodeBERT
The code for code search of CodeBERT can be found here.It is from CodeBERT and we did't modify it.
training:
python run_classifier.py --model_type roberta --task_name codesearch --do_train --do_eval --train_file train_no_comment.txt --dev_file valid_no_comment.txt --max_seq_length 200 --per_gpu_train_batch_size 320 --per_gpu_eval_batch_size 320 --learning_rate 1e-5 --num_train_epochs 4 --gradient_accumulation_steps 1 --overwrite_output_dir --data_dir ../data/train_valid/base/ --output_dir ./codebert/base/ --model_name_or_path microsoft/codebert-base
evaluating:
python run_classifier.py --model_type roberta --model_name_or_path microsoft/codebert-base --task_name codesearch --do_predict --output_dir ./codebert/base/ --data_dir ../data/test/base/ --max_seq_length 200 --per_gpu_train_batch_size 320 --per_gpu_eval_batch_size 320 --learning_rate 1e-5 --num_train_epochs 4 --test_file batch_0.txt --pred_model_dir ./codebert/base/ --test_result_dir ./results/codebert/base/0_batch_result.txt
CodeT5
The code for code search of CodeT5 can be found here.It is originally from DietCode.And we modified it for code search and not remove tokens from the code.
training:
python run_classifier.py --model_type codet5 --task_name codesearch --do_train --do_eval --train_file train_no_comment.txt --dev_file valid_no_comment.txt --max_seq_length 200 --per_gpu_train_batch_size 320 --per_gpu_eval_batch_size 320 --learning_rate 1e-5 --num_train_epochs 4 --gradient_accumulation_steps 1 --overwrite_output_dir --data_dir ../data/train_valid/base/ --output_dir ./codet5/base/ --model_name_or_path Salesforce/codet5-base --tokenizer_name Salesforce/codet5-base
evaluating:
python run_classifier.py --model_type codet5 --model_name_or_path Salesforce/codet5-base --task_name codesearch --do_predict --output_dir ./codet5/base/ --data_dir ../data/test/base_t5/ --max_seq_length 200 --per_gpu_train_batch_size 320 --per_gpu_eval_batch_size 320 --learning_rate 1e-5 --num_train_epochs 4 --test_file batch_0.txt --pred_model_dir ./codet5/base/checkpoint-best/ --test_result_dir ./results/codet5/base/0_batch_result.txt --tokenizer_name Salesforce/codet5-base
Code2nl
CodeBERT
The code for code2nl of CodeBERT can be found here.It is originally from CodeBert.And we modified the code for fixed epochs and evaluate only in the end of every epoch for time comparation.
training:
python run_codebert.py --do_train --do_eval --model_type roberta --model_name_or_path microsoft/codebert-base --train_filename ./data/base/train.txt --dev_filename ../data/base/valid.txt --output_dir ./codebert/base --max_source_length 256 --max_target_length 128 --beam_size 10 --train_batch_size 64 --eval_batch_size 64 -learning_rate 5e-5
evaluating:
python run_codebert_three.py --do_test --model_type roberta --model_name_or_path microsoft/codebert-base --load_model_path codebert/base/checkpoint-best-bleu/pytorch_model.bin --test_filename ./data/base/test.txt --output_dir codebert/base --max_source_length 256 --max_target_length 128 --beam_size 10 --eval_batch_size 64
CodeT5
The code for code2nl of CodeT5 can be found here.It is originally from CodeT5. We modfied the code for fixed epochs and not stop early.
training and evaluating:
python run_exp.py --model_tag codet5_base --task summarize --sub_task java