RuSentNE-LLM-Benchmark • [](https://twitter.com/nicolayr_/status/1781330684289658933)

October 8, 2024 · View on GitHub

Update November 01 2024: ⭐ Implemented a separated bulk-chain project for handling massive amount of prompts with CoT. This concept was used in this studies.

Update 06 September 2024: Mentioning the related information about the project at BU-research-blog

Update 11 August 2024: 🎤 Announcing the talk on this framework @ NLPSummit 2024 with the preliminary ad and details in X/Twitter post 🐦.

Update 23 June 2024: All metrics in Development mode has been evaluated under closest mode which makes a decision of the result class by relying on the first entry of the label.

Update 11 June 2024: Added evaluation mode that counts first label entry. See eval-mode parameter key.

This repository assess the LLMs reasoning capabilities in Targeted Sentiment Analysis on RuSentNE dataset proposed as a part of the self-titled competition.

In particular, we use pre-treained LLMs for the following datset splits:

🔓 Development
🔒 Final

The following reasoning we use [quick-cot] to experiment with:

Instruction Prompts
Chain-of-Thoughts (THoR)

🔍 Accessing the results

All the sqlite results are stored in contents table.

Option 1. You may use sqlitebrowser for accessing the results for exporting into CSV. accessability

Option 2. Use sqlite2csv.py script implemented in this repository.

🔓 Development Results

This is an open-access dataset split (sentiment labels available) utilized for the development stage and could be used anyone in evaluation checks.

Dataset: valiation_data_labeled.csv

* -- denotes evaluation in first-entry mode (seeking for the first entry).

Model	lang	Mode	F1(P,N)	F1(P,N,0)	N/A %	Answers
GPT-3.5-0613	🇺🇸	CoT THoR	43.46	46.16	0.21	answers
GPT-3.5-1106	🇺🇸	CoT THoR	40.83	39.91	0.49	answers
mistral-7b	🇺🇸	CoT THoR	42.34	51.43	0.04	answers

Model	lang	Mode	F1(P,N)	F1(P,N,0)	N/A %	Answers
Proprietary
GPT-4-turbo-2024-04-09	🇺🇸	zero-shot	50.79	61.19	0.0	answers
GPT-3.5-0613	🇺🇸	zero-shot	47.1	57.76	0.0	answers
GPT-3.5-1106	🇺🇸	zero-shot	45.79	52.55	0.0	answers
mistral-large-latest	🇺🇸	zero-shot	44.48	57.24	0.0	answers
gpt-4o	🇺🇸	zero-shot	42.84	56.19	0.0	answers
Open & Less 100B
llama-3-70b-instruct	🇺🇸	zero-shot	49.79	61.24	0.0	answers
mixtral-8x22b	🇺🇸	zero-shot	46.09	58.24	0.0	answers
Phi-3-small-8k-instruct	🇺🇸	zero-shot	46.87	57.02	0.07	answers
mixtral-8x7b	🇺🇸	zero-shot	47.33	56.36	0.07	answers
llama-2-70b-chat	🇺🇸	zero-shot	42.42	54.25	13.44	answers
Open & Less 10B
Gemma-2-9b-it	🇺🇸	zero-shot	45.57	55.06	0.0	answers
llama-3-8b-instruct	🇺🇸	zero-shot	45.25	54.43	0.0	answers
Mistral-7B-Instruct-v0.3	🇺🇸	zero-shot	45.23	55.5	0.0	answers
Phi-3-mini-4k-instruct	🇺🇸	zero-shot	44.62	54.71	0.0	answers
Qwen1.5-7B-Chat	🇺🇸	zero-shot	44.39	55.55	0.04	answers
google_flan-t5-xl	🇺🇸	zero-shot	43.73	53.72	0.0	answers
mistral-7b	🇺🇸	zero-shot	43.11	53.64	0.11	answers
Qwen2-7B-Instruct	🇺🇸	zero-shot	39.74	48.11	3.87	answers
Qwen2-1.5B-Instruct	🇺🇸	zero-shot	33.88	48.59	0.0	answers
Qwen1.5-1.8B-Chat	🇺🇸	zero-shot	33.65	47.28	0.04	answers
Open & Less 1B
Flan-T5-large	🇺🇸	zero-shot	36.72	24.51	0.0	answers
Qwen2-0.5B-Instruct	🇺🇸	zero-shot	9.52	33.0	0.0	answers

Model	lang	Mode	F1(P,N)	F1(P,N,0)	N/A %	Answers
Proprietary
GPT-3.5-0613	🇷🇺	zero-shot	44.15	53.63	1.51	answers
gpt-4o	🇷🇺	zero-shot	44.15	57.5	0.0	answers
GPT-4-turbo-2024-04-09	🇷🇺	zero-shot	42.21	56.36	0.0	answers
GPT-3.5-1106	🇷🇺	zero-shot	41.34	46.83	0.46	answers
mistral-large-latest	🇷🇺	zero-shot	22.33	43.07	0.04	answers
Open & Less 100B
llama-3-70b-instruct	🇷🇺	zero-shot	45.89	58.73	0.0	answers
mixtral-8x22b	🇷🇺	zero-shot	42.64	54.91	0.0	answers
mixtral-8x7b	🇷🇺	zero-shot	41.11	53.75	0.18	answers
Phi-3-small-8k-instruct	🇷🇺	zero-shot	40.65	49.64	0.14	answers
llama-2-70b-chat	🇷🇺	zero-shot	29.51	27.27	1.65	answers
Open & Less 10B
Gemma-2-9b-it	🇷🇺	zero-shot	46.5	55.9	0.04	answers
Qwen2-7B-Instruct	🇷🇺	zero-shot	42.16	51.13	0.25	answers
mistral-7b	🇷🇺	zero-shot	42.14	47.57	0.18	answers
mistral-7B-Instruct-v0.3	🇷🇺	zero-shot	41.73	44.24	0.18	answers
llama-3-8b-instruct	🇷🇺	zero-shot	40.55	47.81	0.35	answers
Qwen1.5-7B-Chat	🇷🇺	zero-shot	34.1	45.05	0.25	answers
Phi-3-mini-4k-instruct	🇷🇺	zero-shot	33.79	24.33	0.04	answers
Qwen2-1.5B-Instruct	🇷🇺	zero-shot	20.5	33.57	0.35	answers
Qwen1.5-1.8B-Chat	🇷🇺	zero-shot	11.74	8.05	0.42	answers
Open & Less 1B
Qwen2-0.5B-Instruct	🇷🇺	zero-shot	11.76	18.12	0.25	answers

🔒 Final Results

This leaderboard and obtained LLM answers is a part of the experiments in paper: Large Language Models in Targeted Sentiment Analysis in Russian.

Dataset: final_data.csv

Model	lang	Mode	F1(P,N)	F1(P,N,0)	N/A %	Answers
GPT-4-1106-preview	🇺🇸	CoT THoR	50.13	55.93	-	answers
GPT-3.5-0613	🇺🇸	CoT THoR	44.50	48.17	-	answers
GPT-3.5-1106	🇺🇸	CoT THoR	42.58	42.18	-	answers

GPT-4-1106-preview	🇺🇸	zero-shot (short)	54.59	64.32	-	answers
GPT-3.5-0613	🇺🇸	zero-shot (short)	51.79	61.38	-	answers
GPT-3.5-1106	🇺🇸	zero-shot (short)	47.04	53.19	-	answers

Mistral-7B-instruct-v0.1	🇺🇸	zero-shot	49.46	58.51	-	answers
Mistral-7B-instruct-v0.2	🇺🇸	zero-shot	44.82	56.04	-	answers
DeciLM	🇺🇸	zero-shot	43.85	53.65	1.44	answers
Microsoft-Phi-2	🇺🇸	zero-shot	40.95	42.77	3.13	answers
Gemma-7B-IT	🇺🇸	zero-shot	40.96	44.63	-	answers
Gemma-2B-IT	🇺🇸	zero-shot	31.75	45.96	2.62	answers
Flan-T5-xxl	🇺🇸	zero-shot	36.46	42.63	1.90	answers

Model	lang	Mode	F1(P,N)	F1(P,N,0)	N/A %	Answers
GPT-4-1106-preview	🇷🇺	zero-shot (short)	48.04	60.55	0.0	answers
GPT-3.5-0613	🇷🇺	zero-shot (short)	45.85	57.36	0.0	answers
GPT-3.5-1106	🇷🇺	zero-shot (short)	35.07	48.53	0.0	answers

Mistral-7B-Instruct-v0.2	🇷🇺	zero-shot	42.60	48.05	0.0	answers

References

If you find the results and findings in Final Results section valuable 💎, feel free to cite the related work as follows:

@misc{rusnachenko2024large,
      title={Large Language Models in Targeted Sentiment Analysis}, 
      author={Nicolay Rusnachenko and Anton Golubev and Natalia Loukachevitch},
      year={2024},
      eprint={2404.12342},
      archivePrefix={arXiv},
      primaryClass={cs.CL}
}