RuSentNE-LLM-Benchmark • [](https://twitter.com/nicolayr_/status/1781330684289658933)

October 8, 2024 · View on GitHub

Update November 01 2024: ⭐ Implemented a separated bulk-chain project for handling massive amount of prompts with CoT. This concept was used in this studies.

Update 06 September 2024: Mentioning the related information about the project at BU-research-blog

Update 11 August 2024: 🎤 Announcing the talk on this framework @ NLPSummit 2024 with the preliminary ad and details in X/Twitter post 🐦. twitter

Update 23 June 2024: All metrics in Development mode has been evaluated under closest mode which makes a decision of the result class by relying on the first entry of the label.

Update 11 June 2024: Added evaluation mode that counts first label entry. See eval-mode parameter key.

This repository assess the LLMs reasoning capabilities in Targeted Sentiment Analysis on RuSentNE dataset proposed as a part of the self-titled competition.

In particular, we use pre-treained LLMs for the following datset splits:

  1. 🔓 Development
  2. 🔒 Final

The following reasoning we use [quick-cot] to experiment with:

  • Instruction Prompts
  • Chain-of-Thoughts (THoR)

🔍 Accessing the results

All the sqlite results are stored in contents table.

Option 1. You may use sqlitebrowser for accessing the results for exporting into CSV. accessability

Option 2. Use sqlite2csv.py script implemented in this repository.

🔓 Development Results

twitter

This is an open-access dataset split (sentiment labels available) utilized for the development stage and could be used anyone in evaluation checks.

Dataset: valiation_data_labeled.csv

* -- denotes evaluation in first-entry mode (seeking for the first entry).

ModellangModeF1(P,N)F1(P,N,0)N/A %Answers
GPT-3.5-0613🇺🇸CoT THoR43.4646.160.21answers
GPT-3.5-1106🇺🇸CoT THoR40.8339.910.49answers
mistral-7b🇺🇸CoT THoR42.3451.430.04answers
ModellangModeF1(P,N)F1(P,N,0)N/A %Answers
Proprietary
GPT-4-turbo-2024-04-09🇺🇸zero-shot50.7961.190.0answers
GPT-3.5-0613🇺🇸zero-shot47.157.760.0answers
GPT-3.5-1106🇺🇸zero-shot45.7952.550.0answers
mistral-large-latest🇺🇸zero-shot44.4857.240.0answers
gpt-4o🇺🇸zero-shot42.8456.190.0answers
Open & Less 100B
llama-3-70b-instruct🇺🇸zero-shot49.7961.240.0answers
mixtral-8x22b🇺🇸zero-shot46.0958.240.0answers
Phi-3-small-8k-instruct🇺🇸zero-shot46.8757.020.07answers
mixtral-8x7b🇺🇸zero-shot47.3356.360.07answers
llama-2-70b-chat🇺🇸zero-shot42.4254.2513.44answers
Open & Less 10B
Gemma-2-9b-it🇺🇸zero-shot45.5755.060.0answers
llama-3-8b-instruct🇺🇸zero-shot45.2554.430.0answers
Mistral-7B-Instruct-v0.3🇺🇸zero-shot45.2355.50.0answers
Phi-3-mini-4k-instruct🇺🇸zero-shot44.6254.710.0answers
Qwen1.5-7B-Chat🇺🇸zero-shot44.3955.550.04answers
google_flan-t5-xl🇺🇸zero-shot43.7353.720.0answers
mistral-7b🇺🇸zero-shot43.1153.640.11answers
Qwen2-7B-Instruct🇺🇸zero-shot39.7448.113.87answers
Qwen2-1.5B-Instruct🇺🇸zero-shot33.8848.590.0answers
Qwen1.5-1.8B-Chat🇺🇸zero-shot33.6547.280.04answers
Open & Less 1B
Flan-T5-large🇺🇸zero-shot36.7224.510.0answers
Qwen2-0.5B-Instruct🇺🇸zero-shot9.5233.00.0answers
ModellangModeF1(P,N)F1(P,N,0)N/A %Answers
Proprietary
GPT-3.5-0613🇷🇺zero-shot44.1553.631.51answers
gpt-4o🇷🇺zero-shot44.1557.50.0answers
GPT-4-turbo-2024-04-09🇷🇺zero-shot42.2156.360.0answers
GPT-3.5-1106🇷🇺zero-shot41.3446.830.46answers
mistral-large-latest🇷🇺zero-shot22.3343.070.04answers
Open & Less 100B
llama-3-70b-instruct🇷🇺zero-shot45.8958.730.0answers
mixtral-8x22b🇷🇺zero-shot42.6454.910.0answers
mixtral-8x7b🇷🇺zero-shot41.1153.750.18answers
Phi-3-small-8k-instruct🇷🇺zero-shot40.6549.640.14answers
llama-2-70b-chat🇷🇺zero-shot29.5127.271.65answers
Open & Less 10B
Gemma-2-9b-it🇷🇺zero-shot46.555.90.04answers
Qwen2-7B-Instruct🇷🇺zero-shot42.1651.130.25answers
mistral-7b🇷🇺zero-shot42.1447.570.18answers
mistral-7B-Instruct-v0.3🇷🇺zero-shot41.7344.240.18answers
llama-3-8b-instruct🇷🇺zero-shot40.5547.810.35answers
Qwen1.5-7B-Chat🇷🇺zero-shot34.145.050.25answers
Phi-3-mini-4k-instruct🇷🇺zero-shot33.7924.330.04answers
Qwen2-1.5B-Instruct🇷🇺zero-shot20.533.570.35answers
Qwen1.5-1.8B-Chat🇷🇺zero-shot11.748.050.42answers
Open & Less 1B
Qwen2-0.5B-Instruct🇷🇺zero-shot11.7618.120.25answers

🔒 Final Results

arXiv

This leaderboard and obtained LLM answers is a part of the experiments in paper: Large Language Models in Targeted Sentiment Analysis in Russian.

Dataset: final_data.csv

ModellangModeF1(P,N)F1(P,N,0)N/A %Answers
GPT-4-1106-preview🇺🇸CoT THoR50.1355.93-answers
GPT-3.5-0613🇺🇸CoT THoR44.5048.17-answers
GPT-3.5-1106🇺🇸CoT THoR42.5842.18-answers
GPT-4-1106-preview🇺🇸zero-shot (short)54.5964.32-answers
GPT-3.5-0613🇺🇸zero-shot (short)51.7961.38-answers
GPT-3.5-1106🇺🇸zero-shot (short)47.0453.19-answers
Mistral-7B-instruct-v0.1🇺🇸zero-shot49.4658.51-answers
Mistral-7B-instruct-v0.2🇺🇸zero-shot44.8256.04-answers
DeciLM🇺🇸zero-shot43.8553.651.44answers
Microsoft-Phi-2🇺🇸zero-shot40.9542.773.13answers
Gemma-7B-IT🇺🇸zero-shot40.9644.63-answers
Gemma-2B-IT🇺🇸zero-shot31.7545.962.62answers
Flan-T5-xxl🇺🇸zero-shot36.4642.631.90answers
ModellangModeF1(P,N)F1(P,N,0)N/A %Answers
GPT-4-1106-preview🇷🇺zero-shot (short)48.0460.550.0answers
GPT-3.5-0613🇷🇺zero-shot (short)45.8557.360.0answers
GPT-3.5-1106🇷🇺zero-shot (short)35.0748.530.0answers
Mistral-7B-Instruct-v0.2🇷🇺zero-shot42.6048.050.0answers

References

If you find the results and findings in Final Results section valuable 💎, feel free to cite the related work as follows:

@misc{rusnachenko2024large,
      title={Large Language Models in Targeted Sentiment Analysis}, 
      author={Nicolay Rusnachenko and Anton Golubev and Natalia Loukachevitch},
      year={2024},
      eprint={2404.12342},
      archivePrefix={arXiv},
      primaryClass={cs.CL}
}