SIGHAN 2024 Shared Task for Chinese Dimensional Aspect-Based Sentiment Analysis (dimABSA)

September 4, 2024 · View on GitHub

Aspect-Based Sentiment Analysis (ABSA) is a critical NLP research topic that aims to identify the aspects of a given entity and analyzing the sentiment polarity associated with each aspect. In recent years, numerous research effects have been made on ABSA, which can be categorized into different tasks based on the number of sentimental elements to be extracted. For example, Aspect Sentiment Triplet Extraction (ASTE) task extracts three elements in a triplet, including aspect/target term, opinion term and sentiment polarity (e.g., positive, neutral, and negative). Furthermore, Aspect Sentiment Quadruple Prediction (ASQP) task extracts the same three elements plus an additional aspect category to construct a quadruple. However, compared to representing affective states as several discrete classes (i.e., polarity), the dimensional approach that represents affective states as continuous numerical values (called intensity) in multiple dimensions such as valence-arousal (VA) space, providing more fine-grained emotional information (Lee et al., 2022).

Therefore, we organize a Chinese dimensional ABSA shared task (dimABSA) in the SIGHAN 2024 workshop, providing fine-grained sentiment intensity prediction for each extracted aspect of a restaurant review. The four sentiment elements are defined as follows:

  • Aspect Term (shorted as A): This denotes an entity indicating the opinion target. If the aspect is omitted without being mentioned clearly, we use “NULL” to represent the term.
  • Aspect Category (C): This represents a predefined category for the explicit aspect of the restaurant domain. We use the same categories defined in the SemEval-2016 Restaurant dataset (Pontiki et al., 2016). There are a total of twelve categories; each can be split into an entity and attribute using the symbol “#.” We describe them as follows: “餐廳#概括” (餐厅#概括, restaurant#general), “餐廳#價格”(餐厅#价格, restaurant#prices), “餐廳#雜項” (餐厅#杂项, restaurant#miscellaneous),“食物#價格” (食物#价格, food#prices), “食物#品質” (食物#品质, food#quality), “食物#份量與款式”(食物#份量与款式, food#style&options), “飲料#價格” (饮料#价格, drinks#prices), “飲料#品質”(饮料#品质, drinks#quality), “飲料#份量與款式”(饮料#份量与款式, drinks#style&options), “氛圍#概括”(氛围#概括, ambience#general), “服務#概括” (服务#概括, services#general) and “地點#概括” (地点#概括, location#general).
  • Opinion Term (O): This describes the sentiment words or phrases towards the aspects.
  • Sentiment Intensity (I): This reflects respective sentiments using continuous real-valued scores in the valence-arousal dimensions. The valence represents the degree of pleasant and unpleasant (i.e., positive and negative) feelings, while the arousal represents the degree of excitement and calm. Both the valence and arousal dimensions use a nine-degree scale. Value 1 on the valence and arousal dimensions denotes extremely high-negative and low-arousal sentiment, respectively. In contrast, 9 denotes extremely high-positive and high-arousal sentiment, and 5 denotes a neutral and medium-arousal sentiment. Valence-arousal values are separated by a hashtag (symbol “#”) for a mark.

Task Description

This task aims to evaluate the capability of an automatic system for Chinese dimensional ABSA. This task can be further divided into three subtasks described as follows.

Subtask 1: Intensity Prediction

The first subtask focuses on predicting sentiment intensities in the valence-arousal dimensions. Given a sentence and a specific aspect, the system should predict the valence-arousal ratings. The input format consists of ID, sentence, and aspect. The output format consists of the ID and valence-arousal predicted values that are separated with a 'space'. The intensity prediction is two real-valued scores rounded to two decimal places and separated by a hashtag, each denotes the valence and arousal rating, respectively.

Example 1


(Traditional Chinese version)

Input: E0001:S001, 檸檬醬也不會太油,塔皮對我而言稍軟。, 檸檬醬#塔皮

Output: E0001:S001 (檸檬醬,5.67#5.5)(塔皮,4.83#5.0)


(Simplified Chinese version)

Input: E0001:S001, 柠檬酱也不会太油,塔皮对我而言稍软。 柠檬酱#塔皮

Output: E0001:S001 (柠檬酱,5.67#5.5)(塔皮,4.83#5.0)

Subtask 2: Triplet Extraction

The second subtask aims to extract sentiment triplets composed of three elements. Given a sentence only, the system should extract all sentiment triplets (aspect, opinion, intensity). The output format consists of the ID and sentiment triplet that are separated with a 'space'.

Example 2


(Traditional Chinese version)

Input: E0002:S002, 不僅餐點美味上菜速度也是飛快耶!!

Output: E0002:S002 (餐點, 美味, 6.63#4.63) (上菜速度, 飛快, 7.25#6.00)


(Simplified Chinese version)

Input: E0002:S002, 不仅餐点美味上菜速度也是飞快耶!!

Output: E0002:S002 (餐点, 美味, 6.63#4.63) (上菜速度, 飞快, 7.25#6.00)

Subtask 3: Quadruple Extraction

The third subtask aims to extract sentiment quadruples composed of four elements. Given a sentence only, the system should extract all sentiment quadruples (aspect, category, opinion, intensity). The output format consists of the ID and sentiment quadruple that are separated with a 'space'.

Example 3


(Traditional Chinese version)

Input: E0003:S003, 這碗拉麵超級無敵霹靂難吃

Output: E0003:S003 (拉麵, 食物#品質, 超級無敵霹靂難吃, 2.00#7.88)


(Simplified Chinese version)

Input: E0003:S003, 这碗拉面超级无敌霹雳难吃

Output: E0003:S003 (拉面, 食物#品质, 超级无敌霹雳难吃, 2.00#7.88)

Data Sets

We first crawled restaurant reviews from Google Reviews and an online bulletin board system PTT. Then, we removed all HTML tags and multimedia material and split the remaining texts into several sentences. Finally, we randomly selected partial sentences to retain content diversity for manual annotation.

The annotation process was conducted in two phases. We first annotated the aspect/category/opinion elements and then V#A element. In the first phase, three graduate students majoring in computer science were trained to annotate the sentences for aspect/category/opinion. One task organizer led a discussion to clarify annotation differences and seek consensus among the annotators. A majority vote mechanism was finally used to resolve any disagreements among the annotators. In the second phase, each sentence along with the annotated aspect/category/opinion was presented to five annotators majoring in Chinese language for V#A rating. Similarly, one task organizer also led a group discussion during annotation. Once the annotation process was finished, a cleanup procedure was performed to remove outlier values which did not fall within 1.5 standard deviations (SD) of the mean. These outliers were then excluded from calculating the average V#A for each instance.

We provided two versions of all datasets with identical content, but one in traditional Chinese characters and the other in simplified Chinese characters. The participating teams could choose their preferred version for the task evaluation. The submitted results were evaluated with the corresponding version of the gold standard and ranked together as the official results.

Restaurant (REST) Domain

Subtask

Dataset

#Sent

#Char

#Tuple
Aspect Opinion
#NULL #Unique #Repeat #Unique #Repeat

ST1

Train
   
6,050   
   
85,769   
   
8,523   
   
169   
   
6,430   
   
1924   
   
-   
   
-   

Dev.
   
100   
   
1,109   
   
115   
   
0   
   
115   
   
0   
   
-   
   
-   

Test
   
2,000   
   
34,002   
   
2,658   
   
0   
   
2,658   
   
0   
   
-   
   
-   

ST2 &
ST3

Train
   
6,050   
   
85,769   
   
8,523   
   
169   
   
6,430   
   
1,924   
   
7,986   
   
537   

Dev.
   
100   
   
1,280   
   
150   
   
0   
   
78   
   
72   
   
143   
   
7   

Test
   
2,000   
   
39,014   
   
3,566   
   
52   
   
1,693   
   
1,821   
   
3263   
   
303   

Chinese EmoBamk

The Chinese EmoBank (Lee et al., 2022) is a dimensional sentiment resource annotated with real-valued scores for both valence and arousal dimensions. The valence represents the degree of positive and negative sentiment, and arousal represents the degree of calm and excitement. Both dimensions range from 1 (highly negative or calm) to 9 (highly positive or excited). The Chinese EmoBank features various levels of text granularity including two lexicons called Chinese valence-arousal words (CVAW, 5,512 single words) and Chinese valence-arousal phrases (CVAP, 2,998 multi-word phrases) and two corpora called Chinese valence-arousal sentences (CVAS, 2,582 single sentences) and Chinese valence-arousal texts (CVAT, 2,969 multi-sentence texts).

Performance Matrics

For Subtask 1, the sentiment intensity prediction performance is evaluated by examining the difference between machine-predicted ratings and human-annotated ratings using two metrics: Mean Absolute Error (MAE) and Pearson Correlation Coefficient (PCC), defined as the following equations.

MAE=1ni=1nαipi MAE=\frac{1}{n} \sum_{i=1}^n\mid \alpha_i - p_i \mid

PCC=1ni=1n(aiμAσA)(piμPσP) PCC=\frac{1}{n} \sum_{i=1}^n (\frac{a_i-μ_A}{σ_A}) (\frac{p_i-μ_P}{σ_P} )

where aiAa_i ∈A and piPp_i ∈P respectively denote the i-th actual value and predicted value, n is the number of test samples, μAμ_A and σAσ_A respectively represent the mean value and the standard deviation of A, while μPμ_P and σPσ_P respectively represent the mean value and the standard deviation of P.

Each metric for the valence and arousal dimensions is calculated and ranked independently. The actual and predicted real values should range from 1 to 9, so MAE measures the error rate in a range where the lowest value is 0 and the highest value is 8. A lower MAE indicates more accurate prediction performance. The PCC is a value between −1 and 1 that measures the linear correlation between the actual and predicted values. A lower MAE and a higher PCC indicate more accurate prediction performance.

For Subtasks 2 and 3, we use the F1-score as the evaluation metric, defined as:

F1=2×P×RP+R F1=\frac{2 \times P \times R}{P+R}

where Precision (P) is defined as the percentage of triplets/quadruples extracted by the system that are correct. Recall (R) is the percentage of triplets/quadruples present in the test set found by the system. The F1-score is the harmonic mean of precision and recall.

Official Ranking

Subtask 1: Intensity Prediction
Team Sub# Evaluation Metrics Overall Rank
V-MAE V-PCC A-MAE A-PCC
HITSZ-HLT 63885 0.279 (1) 0.933 (1) 0.309 (1) 0.777 (1) 1
CCIIPLab 63706 0.294 (2) 0.916 (3) 0.309 (1) 0.766 (3) 2
YNU-HPCC 63756 0.294 (2) 0.917 (2) 0.318 (3) 0.771 (2) 2
DS-Group 62014 0.460 (4) 0.858 (5) 0.501 (4) 0.490 (4) 4
yangnan 61884 1.032 (5) 0.877 (4) 1.095 (5) 0.097 (5) 5
Subtask 2: Triplet Extraction
Team Sub# Evaluation Metrics Overall Rank
V-Tri-F1 A-Tri-F1 VA-Tri-F1
HITSZ-HLT 63885 0.589 (1) 0.545 (1) 0.433 (1) 1
CCIIPLab 63824 0.573 (2) 0.522 (2) 0.403 (2) 2
ZZU-NLP 63737 0.542 (3) 0.507 (3) 0.389 (3) 3
BIT-NLP 63766 0.490 (4) 0.450 (4) 0.342 (4) 4
SUDA-NLP 63827 0.475 (5) 0.448 (5) 0.326 (5) 5
TMAK-Plus 63972 0.269 (6) 0.307 (6) 0.157 (6) 6
Subtask 3: Quadruple Extraction
Team Sub# Evaluation Metrics Overall Rank
V-Quad-F1 A-Quad-F1 VA-Quad-F1
HITSZ-HLT 63885 0.567 (1) 0.526 (1) 0.417 (1) 1
CCIIPLab 63832 0.555 (2) 0.507 (2) 0.389 (2) 2
ZZU-NLP 61868 0.522 (3) 0.489 (3) 0.376 (3) 3
SUDA-NLP 63622 0.487 (4) 0.444 (4) 0.336 (4) 4
JN-NLP 63572 0.482 (5) 0.439 (5) 0.331 (5) 5
BIT-NLP 63766 0.470 (6) 0.434 (7) 0.329 (6) 6
USTC-IAT 63907 0.438 (7) 0.437 (6) 0.312 (7) 7

Citation

Lung-Hao Lee, Liang-Chih Yu, Suge Wang, and Jian Liao. 2024. Overview of the SIGHAN 2024 Shared Task for Chinese Dimensional Aspect-Based Sentiment Analysis. In Proceedings of the 10th SIGHAN Workshop on Chinese Language Processing, pp. 165-174. https://aclanthology.org/2024.sighan-1.19/

@ARTICLE{Lee-SIGHAN-2024,
    author = {Lung-Hao Lee, Liang-Chih Yu, Suge Wang, and Jian Liao},
    title = {Overview of the SIGHAN 2024 Shared Task for Chinese Dimensional Aspect-Based Sentiment Analysis},
    proceedings = {Proceedings of the 10th SIGHAN Workshop on Chinese Language Processing},
    pages = {165-174},
    month = aug,
    year = 2024,
    url = {https://aclanthology.org/2024.sighan-1.19/ }
}

Reference

Lung-Hao Lee, Jian-Hong Li and Liang-Chih Yu. 2022. Chinese EmoBank: Building valence-arousal resources for dimensional sentiment analysis. ACM Transactions on Asian and Low-Resource Language Information Processing, 21(4), Article 65, 18 pages. https://doi.org/10.1145/3489141