RuSentNE-LLM-Benchmark • [](https://twitter.com/nicolayr_/status/1781330684289658933)
October 8, 2024 · View on GitHub
Update November 01 2024: ⭐ Implemented a separated bulk-chain project for handling massive amount of prompts with CoT. This concept was used in this studies.
Update 06 September 2024: Mentioning the related information about the project at BU-research-blog
Update 11 August 2024: 🎤 Announcing the talk on this framework @ NLPSummit 2024 with the preliminary ad and details in X/Twitter post 🐦.
Update 23 June 2024: All metrics in Development mode has been evaluated under
closestmode which makes a decision of the result class by relying on the first entry of the label.
Update 11 June 2024: Added evaluation mode that counts first label entry. See
eval-modeparameter key.
This repository assess the LLMs reasoning capabilities in Targeted Sentiment Analysis on RuSentNE dataset proposed as a part of the self-titled competition.
In particular, we use pre-treained LLMs for the following datset splits:
- 🔓 Development
- 🔒 Final
The following reasoning we use [quick-cot] to experiment with:
- Instruction Prompts
- Chain-of-Thoughts (THoR)
🔍 Accessing the results
All the sqlite results are stored in contents table.
Option 1. You may use sqlitebrowser for accessing the results for exporting into CSV.
Option 2. Use sqlite2csv.py script implemented in this repository.
🔓 Development Results
This is an open-access dataset split (sentiment labels available) utilized for the development stage and could be used anyone in evaluation checks.
Dataset: valiation_data_labeled.csv
* -- denotes evaluation in first-entry mode (seeking for the first entry).
| Model | lang | Mode | F1(P,N) | F1(P,N,0) | N/A % | Answers |
|---|---|---|---|---|---|---|
| GPT-3.5-0613 | 🇺🇸 | CoT THoR | 43.46 | 46.16 | 0.21 | answers |
| GPT-3.5-1106 | 🇺🇸 | CoT THoR | 40.83 | 39.91 | 0.49 | answers |
| mistral-7b | 🇺🇸 | CoT THoR | 42.34 | 51.43 | 0.04 | answers |
| Model | lang | Mode | F1(P,N) | F1(P,N,0) | N/A % | Answers |
|---|---|---|---|---|---|---|
| Proprietary | ||||||
| GPT-4-turbo-2024-04-09 | 🇺🇸 | zero-shot | 50.79 | 61.19 | 0.0 | answers |
| GPT-3.5-0613 | 🇺🇸 | zero-shot | 47.1 | 57.76 | 0.0 | answers |
| GPT-3.5-1106 | 🇺🇸 | zero-shot | 45.79 | 52.55 | 0.0 | answers |
| mistral-large-latest | 🇺🇸 | zero-shot | 44.48 | 57.24 | 0.0 | answers |
| gpt-4o | 🇺🇸 | zero-shot | 42.84 | 56.19 | 0.0 | answers |
| Open & Less 100B | ||||||
| llama-3-70b-instruct | 🇺🇸 | zero-shot | 49.79 | 61.24 | 0.0 | answers |
| mixtral-8x22b | 🇺🇸 | zero-shot | 46.09 | 58.24 | 0.0 | answers |
| Phi-3-small-8k-instruct | 🇺🇸 | zero-shot | 46.87 | 57.02 | 0.07 | answers |
| mixtral-8x7b | 🇺🇸 | zero-shot | 47.33 | 56.36 | 0.07 | answers |
| llama-2-70b-chat | 🇺🇸 | zero-shot | 42.42 | 54.25 | 13.44 | answers |
| Open & Less 10B | ||||||
| Gemma-2-9b-it | 🇺🇸 | zero-shot | 45.57 | 55.06 | 0.0 | answers |
| llama-3-8b-instruct | 🇺🇸 | zero-shot | 45.25 | 54.43 | 0.0 | answers |
| Mistral-7B-Instruct-v0.3 | 🇺🇸 | zero-shot | 45.23 | 55.5 | 0.0 | answers |
| Phi-3-mini-4k-instruct | 🇺🇸 | zero-shot | 44.62 | 54.71 | 0.0 | answers |
| Qwen1.5-7B-Chat | 🇺🇸 | zero-shot | 44.39 | 55.55 | 0.04 | answers |
| google_flan-t5-xl | 🇺🇸 | zero-shot | 43.73 | 53.72 | 0.0 | answers |
| mistral-7b | 🇺🇸 | zero-shot | 43.11 | 53.64 | 0.11 | answers |
| Qwen2-7B-Instruct | 🇺🇸 | zero-shot | 39.74 | 48.11 | 3.87 | answers |
| Qwen2-1.5B-Instruct | 🇺🇸 | zero-shot | 33.88 | 48.59 | 0.0 | answers |
| Qwen1.5-1.8B-Chat | 🇺🇸 | zero-shot | 33.65 | 47.28 | 0.04 | answers |
| Open & Less 1B | ||||||
| Flan-T5-large | 🇺🇸 | zero-shot | 36.72 | 24.51 | 0.0 | answers |
| Qwen2-0.5B-Instruct | 🇺🇸 | zero-shot | 9.52 | 33.0 | 0.0 | answers |
| Model | lang | Mode | F1(P,N) | F1(P,N,0) | N/A % | Answers |
|---|---|---|---|---|---|---|
| Proprietary | ||||||
| GPT-3.5-0613 | 🇷🇺 | zero-shot | 44.15 | 53.63 | 1.51 | answers |
| gpt-4o | 🇷🇺 | zero-shot | 44.15 | 57.5 | 0.0 | answers |
| GPT-4-turbo-2024-04-09 | 🇷🇺 | zero-shot | 42.21 | 56.36 | 0.0 | answers |
| GPT-3.5-1106 | 🇷🇺 | zero-shot | 41.34 | 46.83 | 0.46 | answers |
| mistral-large-latest | 🇷🇺 | zero-shot | 22.33 | 43.07 | 0.04 | answers |
| Open & Less 100B | ||||||
| llama-3-70b-instruct | 🇷🇺 | zero-shot | 45.89 | 58.73 | 0.0 | answers |
| mixtral-8x22b | 🇷🇺 | zero-shot | 42.64 | 54.91 | 0.0 | answers |
| mixtral-8x7b | 🇷🇺 | zero-shot | 41.11 | 53.75 | 0.18 | answers |
| Phi-3-small-8k-instruct | 🇷🇺 | zero-shot | 40.65 | 49.64 | 0.14 | answers |
| llama-2-70b-chat | 🇷🇺 | zero-shot | 29.51 | 27.27 | 1.65 | answers |
| Open & Less 10B | ||||||
| Gemma-2-9b-it | 🇷🇺 | zero-shot | 46.5 | 55.9 | 0.04 | answers |
| Qwen2-7B-Instruct | 🇷🇺 | zero-shot | 42.16 | 51.13 | 0.25 | answers |
| mistral-7b | 🇷🇺 | zero-shot | 42.14 | 47.57 | 0.18 | answers |
| mistral-7B-Instruct-v0.3 | 🇷🇺 | zero-shot | 41.73 | 44.24 | 0.18 | answers |
| llama-3-8b-instruct | 🇷🇺 | zero-shot | 40.55 | 47.81 | 0.35 | answers |
| Qwen1.5-7B-Chat | 🇷🇺 | zero-shot | 34.1 | 45.05 | 0.25 | answers |
| Phi-3-mini-4k-instruct | 🇷🇺 | zero-shot | 33.79 | 24.33 | 0.04 | answers |
| Qwen2-1.5B-Instruct | 🇷🇺 | zero-shot | 20.5 | 33.57 | 0.35 | answers |
| Qwen1.5-1.8B-Chat | 🇷🇺 | zero-shot | 11.74 | 8.05 | 0.42 | answers |
| Open & Less 1B | ||||||
| Qwen2-0.5B-Instruct | 🇷🇺 | zero-shot | 11.76 | 18.12 | 0.25 | answers |
🔒 Final Results
This leaderboard and obtained LLM answers is a part of the experiments in paper: Large Language Models in Targeted Sentiment Analysis in Russian.
Dataset: final_data.csv
| Model | lang | Mode | F1(P,N) | F1(P,N,0) | N/A % | Answers |
|---|---|---|---|---|---|---|
| GPT-4-1106-preview | 🇺🇸 | CoT THoR | 50.13 | 55.93 | - | answers |
| GPT-3.5-0613 | 🇺🇸 | CoT THoR | 44.50 | 48.17 | - | answers |
| GPT-3.5-1106 | 🇺🇸 | CoT THoR | 42.58 | 42.18 | - | answers |
| GPT-4-1106-preview | 🇺🇸 | zero-shot (short) | 54.59 | 64.32 | - | answers |
| GPT-3.5-0613 | 🇺🇸 | zero-shot (short) | 51.79 | 61.38 | - | answers |
| GPT-3.5-1106 | 🇺🇸 | zero-shot (short) | 47.04 | 53.19 | - | answers |
| Mistral-7B-instruct-v0.1 | 🇺🇸 | zero-shot | 49.46 | 58.51 | - | answers |
| Mistral-7B-instruct-v0.2 | 🇺🇸 | zero-shot | 44.82 | 56.04 | - | answers |
| DeciLM | 🇺🇸 | zero-shot | 43.85 | 53.65 | 1.44 | answers |
| Microsoft-Phi-2 | 🇺🇸 | zero-shot | 40.95 | 42.77 | 3.13 | answers |
| Gemma-7B-IT | 🇺🇸 | zero-shot | 40.96 | 44.63 | - | answers |
| Gemma-2B-IT | 🇺🇸 | zero-shot | 31.75 | 45.96 | 2.62 | answers |
| Flan-T5-xxl | 🇺🇸 | zero-shot | 36.46 | 42.63 | 1.90 | answers |
| Model | lang | Mode | F1(P,N) | F1(P,N,0) | N/A % | Answers |
|---|---|---|---|---|---|---|
| GPT-4-1106-preview | 🇷🇺 | zero-shot (short) | 48.04 | 60.55 | 0.0 | answers |
| GPT-3.5-0613 | 🇷🇺 | zero-shot (short) | 45.85 | 57.36 | 0.0 | answers |
| GPT-3.5-1106 | 🇷🇺 | zero-shot (short) | 35.07 | 48.53 | 0.0 | answers |
| Mistral-7B-Instruct-v0.2 | 🇷🇺 | zero-shot | 42.60 | 48.05 | 0.0 | answers |
References
If you find the results and findings in Final Results section valuable 💎, feel free to cite the related work as follows:
@misc{rusnachenko2024large,
title={Large Language Models in Targeted Sentiment Analysis},
author={Nicolay Rusnachenko and Anton Golubev and Natalia Loukachevitch},
year={2024},
eprint={2404.12342},
archivePrefix={arXiv},
primaryClass={cs.CL}
}