FaultyPremise
October 13, 2025 Ā· View on GitHub
Updates
[2025/08] We released codes for this project.
Contents
Introduction
With the advancement of code generation capabilities in large language models (LLMs), their reliance on input premises has intensified. When users provide inputs containing faulty premises, the probability of code generation hallucinations rises significantly, exposing deficiencies in their self-scrutiny capabilities. This paper proposes Faulty Premises Bench (FPBench), the first code generation evaluation framework targeting faulty premises. By systematically constructing three categories of faulty premises and integrating multidimensional evaluation metrics, it conducts in-depth assessments of 15 representative LLMs.
Contribution
- We are the first to propose a comprehensive benchmark specifically designed to assess the self-scrutiny capabilities of LLMs when confronted with faulty premises in code generation tasks.
- We have developed innovative data construction methods, including those based on importance score analysis, random erasure, and the introduction of irrelevant information perturbations. These approaches enable us to systematically construct and expand a test set targeting faulty premises (comprising 1,800 problems in total) from existing code datasets.
- We have designed a unique set of evaluation dimensions, including āproactive error identification rateā, āpassive error identification rateā, and āself-scrutiny overhead ratioā. These metrics aim to comprehensively quantify the modelās ability to identify, process, and respond to faulty premises, as well as its resource consumption.
Data Construction
We randomly collected 600 pieces of raw data from two datasets: HumanEval, MBPP+. Based on the following introduced three different types of erroneous premises defined by us, we reconstructed them into FPbench. Each one is designed to evaluate different aspects of the modelās ability to recognize and reason about flawed input. By constructing 600 base problems for each error type, we obtained a total of 1,800 unique base problems. This structure and scalable design enables rigorous evaluation of how self-scrutiny capabilities are influenced by error types and task complexity.
Results
Run Code
Inference
Run following commad to get LMM's responses.
python data_synthesis\inference.py --model_name <model_name>
Evaluation
Run following commad to get o3's evaluation result to corresponding responses.
python evaluation\evaluate.py --model_folder <model_responses> --model_name <model_name>
Citation
@article{li2025refining,
title={Refining Critical Thinking in LLM Code Generation: A Faulty Premise-based Evaluation Framework},
author={Li, Jialin and Li, Jinzhe and Li, Gengxu and Chang, Yi and Wu, Yuan},
journal={arXiv preprint arXiv:2508.03622},
year={2025}
}
Please cite our paper if you find our research and code useful.

