LLMs Sensitivity & Consistency (NAACL 2025)

April 14, 2025 · View on GitHub

This is the official repository of the NAACL 2025 paper "What Did I Do Wrong? Quantifying LLMs’ Sensitivity and Consistency to Prompt Engineering".

⚠️ The LLMetric Python package, integrated with Langfuse, is available in the langfuse_integration branch.

Citing our work

If you found our metrics useful, please cite our work:

@inproceedings{errica_what_2025,
  author    = {Federico Errica and
               Giuseppe Siracusano and
               Davide Sanvito and
               Roberto Bifulco},
  title     = {What Did I Do Wrong? Quantifying LLMs’ Sensitivity and Consistency to Prompt Engineering},
  booktitle = {Proceedings of the 2025 Annual Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies (NAACL)},
  year      = {2025},
}

Installation

After you have cloned the repository, let's create an ad-hoc environment for this project

python3.10 -m venv .venv/llm
source .venv/llm/bin/activate
pip install -r requirements.txt

Now you need to update the constants.py file with the URL of your LLama3 and Mixtral servers. Please refer to utils.py to see how this information is used in classes DefaultNLELlama3_70bChatOpenAI and DefaultNLEMixtral_8x7bChatOpenAI.

Run the Notebooks

You should now be able to run our notebooks to reproduce our results. There is one notebook for each dataset tested.