SlimCode

October 16, 2023 · View on GitHub

This repo provides the code for reproducing the experiments in SlimCode. SlimCode is a program simplification method that consider more on the nature of the code.

Requirments

Quick Start

Prepare the dataset

We use the same dataset as CodeBERT and Dietcode. But we remove all the comments in the code to make more code can be converted to AST and remove the code that can't be converted to AST after removing the comments. The original data can be downloaded from CodeBERT and our preprocessed data can be download from SlimCode.The final specific data volume is summarized as follows.

Dataset Volume	Code Search			Code2nl
Dataset Volume	train	valid	test	train	valid	test
CodeBERT/DietCode	908886	30655	26909	164923	5183	10955
SlimCode	904817	30437	26780	164813	5183	10948

Process the dataset

Random process

We remove tokens from the code randomly and we reference the code from DietCode.Our modified code can be found here.We use the code to remove 10%-50% tokens from the given code snippet.

Category process

We divide the tokens in the code into 3 levels: lexical level, syntactic level and semantic level. The lexical level includes symbol tokens and identifiers. Syntactic level includes structure tokens, signature tokens and invocation tokens. The semantic level includes the tokens in PDG. In the first two levels,we recognize identifiers,structure tokens,signature tokens and invocation tokens from the code by AST.And we use JavaParser to convert the code into AST and then we remove the tokens from the code by AST independently.The relationship of our category and the node type can be seen in the table.

level	category	node type
lexical	symbol tokens	None
lexical	identifier	NameExpr node VariableDeclarationExpr
syntactic	structure tokens	TryStmt node IfStmt node SwitchStmt node WhileStmt node DoStmt node ForStmt node ForeachStmt node
	signature tokens	MethodDeclaration node
	invocation tokens	MethodCallExpr node
semantic	PDG tokens	javaDependencyGraph

For the last level,we modfied the javaDependencyGraph to generate PDG for a large number of functions in our dataset.Our modified code can be found here.Because we remove the tokens in the code in line,the function of the code in the data should include "\n" in the end of the line so that we can remove the code by PDG in line.So we provide our preprocessed dataset for PDG.Our preprocessed dataset can be found here.

DietCode process

We modified the code of DietCode to process the dataset in diffient removal percent.Our modified code can be found here.After the dataset is processed by DietCode,then we feed them to CodeBERT and CodeT5 for codesearch and code2nl.

SlimCode process

Based on the result of category removal,we proposed SlimCode.Its core idea is to prioritize removing words that have less impact on downstream tasks according to our results.Our removal order is symbol tokens > the tokens beyond our category > not identifier tokens in structure > not identifier tokens in invocation > identifiers not in structure and invocation > identifiers in invocation > identifiers in structure > signature tokens.Similarly,we get removal order by AST and then remove them in the code in different removal percent.Our code can be found here.It's necessary to have a JDK8 in your computer and then you can use the follow command to compile the code.
./jdk1.8.0_341/bin/javac -classpath ./javaparser-core-3.6.5.jar -d bin SlimCode.java RemoveAll.java SpanContent.java -Xlint:unchecked
And then you can use the follow command to run the code.
./jdk1.8.0_341/bin/java -classpath ./javaparser-core-3.6.5.jar:bin/ SlimCode

Finetune

After processing the dataset, you can feed the data into codebert,codet5 for codesearch and code summarization.

Code Search

CodeBERT

The code for code search of CodeBERT can be found here.It is from CodeBERT and we did't modify it.
training:

python run_classifier.py --model_type roberta --task_name codesearch --do_train --do_eval --train_file train_no_comment.txt --dev_file valid_no_comment.txt --max_seq_length 200 --per_gpu_train_batch_size 320 --per_gpu_eval_batch_size 320 --learning_rate 1e-5 --num_train_epochs 4 --gradient_accumulation_steps 1 --overwrite_output_dir --data_dir ../data/train_valid/base/ --output_dir ./codebert/base/  --model_name_or_path microsoft/codebert-base

evaluating:

python run_classifier.py --model_type roberta --model_name_or_path microsoft/codebert-base --task_name codesearch --do_predict --output_dir ./codebert/base/ --data_dir ../data/test/base/ --max_seq_length 200 --per_gpu_train_batch_size 320 --per_gpu_eval_batch_size 320 --learning_rate 1e-5 --num_train_epochs 4 --test_file batch_0.txt --pred_model_dir ./codebert/base/ --test_result_dir ./results/codebert/base/0_batch_result.txt

CodeT5

The code for code search of CodeT5 can be found here.It is originally from DietCode.And we modified it for code search and not remove tokens from the code.
training:

python run_classifier.py --model_type codet5 --task_name codesearch --do_train --do_eval --train_file train_no_comment.txt --dev_file valid_no_comment.txt --max_seq_length 200 --per_gpu_train_batch_size 320 --per_gpu_eval_batch_size 320 --learning_rate 1e-5 --num_train_epochs 4 --gradient_accumulation_steps 1 --overwrite_output_dir --data_dir ../data/train_valid/base/ --output_dir ./codet5/base/ --model_name_or_path Salesforce/codet5-base --tokenizer_name Salesforce/codet5-base

evaluating:

python run_classifier.py --model_type codet5 --model_name_or_path Salesforce/codet5-base --task_name codesearch --do_predict --output_dir ./codet5/base/ --data_dir ../data/test/base_t5/ --max_seq_length 200 --per_gpu_train_batch_size 320 --per_gpu_eval_batch_size 320 --learning_rate 1e-5 --num_train_epochs 4 --test_file batch_0.txt --pred_model_dir ./codet5/base/checkpoint-best/ --test_result_dir ./results/codet5/base/0_batch_result.txt --tokenizer_name Salesforce/codet5-base

Code2nl

CodeBERT

The code for code2nl of CodeBERT can be found here.It is originally from CodeBert.And we modified the code for fixed epochs and evaluate only in the end of every epoch for time comparation.
training:

python run_codebert.py --do_train --do_eval --model_type roberta --model_name_or_path microsoft/codebert-base --train_filename ./data/base/train.txt --dev_filename ../data/base/valid.txt --output_dir ./codebert/base --max_source_length 256 --max_target_length 128 --beam_size 10 --train_batch_size 64 --eval_batch_size 64 -learning_rate 5e-5

evaluating:

python run_codebert_three.py --do_test --model_type roberta --model_name_or_path microsoft/codebert-base --load_model_path codebert/base/checkpoint-best-bleu/pytorch_model.bin  --test_filename ./data/base/test.txt --output_dir codebert/base --max_source_length 256 --max_target_length 128 --beam_size 10 --eval_batch_size 64

CodeT5

The code for code2nl of CodeT5 can be found here.It is originally from CodeT5. We modfied the code for fixed epochs and not stop early.
training and evaluating:

python run_exp.py --model_tag codet5_base --task summarize --sub_task java