Rethinking Repetition Problems of LLMs in Code Generation

July 7, 2025 ยท View on GitHub

Accepted as an oral presentation at ACL 2025 main conference (Acceptance Rate < 20.5%, Oral Rate < 2.94%).

Dataset Description

File NameScenario# SamplesDescription
ArtificialSynthesis.jsonlArtificial Synthesis512Each sample consists of a correct code concatenated with its last repetition patterns 5 to 10 times
CodeGenerationBenchmarks.jsonlCode Generation Benchmarks512Repetitive codes are selected from the generated repetitive codes of three LLMs on HumanEval and MBPP benchmarks
Real-worldRepositories.jsonlReal-world Repositories1024Picked from the partial code in real-world repositories

Usage

python generate_code.py \
  --model_path ./LLMs/CodeLlama-7b-hf/ \
  --data_path ./datasets/ArtificialSynthesis.jsonl \
  --save_path ./results/output.jsonl

Optional Arguments

ArgumentDescriptionDefault
--model_pathPath to the Hugging Face-compatible model./LLMs/CodeLlama-7b-hf/
--data_pathPath to the JSONL file with input prompts./datasets/ArtificialSynthesis.jsonl
--save_pathPath to save the generated outputs./result.jsonl

Metrics

Caculate TR-N/TR-S

python metric/rep.py

Caculate CCP

python metric/compile.py

Caculate PPL

python metric/ppl.py

Caculate EGP and Time

python metric/eos&length&time.py

Citation

@article{dong2025repetition,
  title={Rethinking Repetition Problems of LLMs in Code Generation},
  author={Dong, Yihong and Liu Yuchen and Jiang, Xue and Jin, Zhi and Li, Ge},
  journal={arXiv preprint arXiv:2505.10402},
  year={2025}
}