WYWEB
December 13, 2023 ยท View on GitHub
An evaluation bentchmark for classical Chinese. This work has been accepted by Findings of ACL 2023.
Classical Chinese is a treasure of the entire human cultural history. We contribute this work with the hope of helping the entire community to be more prosperous. Our work will be an open, community-driven project which improves with the advancement of technology.
We hope more people join in to make this benchmark better and more useful.
Leader-board
Online leader-board
See WYWEB on CADAL for the official leader-board.
Main
| Models | Avg. | PUNC | GLNER | GJC | FSPC | TLC | Xuci | WYWRC | IRC |
|---|---|---|---|---|---|---|---|---|---|
| Human | 88.0 | 92.4 | 94.3 | 90.3 | 80.0 | 89.0 | 85.3 | 80.0 | 92.3 |
| DeBERTa-base | 75.9 | 83.3 | 86.7 | 85.2 | 61.1 | 86.7 | 72.4 | 45.1 | 86.7 |
| GuwenBERT-base | 72.9 | 82.5 | 82.8 | 84.8 | 61.3 | 85.1 | 71.7 | 28.0 | 86.8 |
| GuwenBERT-large | 75.6 | 83.1 | 86.1 | 84.9 | 58.5 | 87.6 | 73.4 | 44.4 | 87.8 |
| GuwenBERT-base-fs | 74.6 | 82.9 | 84.8 | 84.2 | 61.0 | 86.7 | 70.0 | 42.1 | 85.3 |
| RoBERTa-CCBC | 74.5 | 82.5 | 84.7 | 84.5 | 59.5 | 85.0 | 73.2 | 40.7 | 86.1 |
| RoBERTa-CCLC | 75.3 | 82.8 | 86.1 | 84.7 | 58.6 | 87.1 | 74.9 | 41.0 | 86.9 |
| SikuBERT | 73.7 | 80.8 | 82.8 | 82.2 | 60.9 | 82.4 | 70.4 | 44.0 | 85.8 |
| SikuRoBERTa | 73.5 | 81.4 | 82.8 | 82.5 | 62.2 | 83.8 | 68.5 | 41.0 | 85.8 |
| RoBERTa-wwm-ext | 72.1 | 78.8 | 79.8 | 81.3 | 59.2 | 78.3 | 71.0 | 42.1 | 86.2 |
WYWMT
| Model | BLEU | chrF2 | TER | ROUGE-1 | ROUGE-2 | ROUGE-L |
|---|---|---|---|---|---|---|
| Human | 45.6 | 44.2 | 34.4 | 77.4 | 50.7 | 76.2 |
| guwenbert-base | 40.1 | 38.1 | 37.5 | 72.5 | 46.0 | 70.3 |
| guwenbert-large | 38.8 | 37.2 | 38.1 | 70.1 | 43.7 | 67.7 |
| guwenbert-base-fs | 36.3 | 35.2 | 39.2 | 68.3 | 41.2 | 65.7 |
| roberta-CCBC | 39.1 | 37.1 | 36.8 | 71.4 | 44.9 | 69.3 |
| roberta-CCLC | 39.8 | 38.0 | 36.4 | 71.6 | 45.3 | 69.3 |
| SikuBERT | 38.8 | 36.2 | 37.9 | 72.0 | 45.5 | 69.8 |
| SikuRoBERTa | 39.1 | 36.5 | 37.7 | 72.2 | 45.7 | 70.0 |
| DeBERTa-base | 39.5 | 37.8 | 35.9 | 71.9 | 44.2 | 68.7 |
| Roberta-wwm-ext | 38.0 | 35.8 | 39.1 | 69.9 | 43.2 | 66.7 |
How to test new models?
This is an evaluation benchmark for classical Chinese NLP providing several tasks. Researchers could quickly evaluate pre-trained language models with a few lines of code using the evaluation toolkit.
Quick Run The Base Line
python run.py \
--tag wywweb \
--do_train \
--max_seq_len 512 \
--dump 1000 \
--task_name GJCTask \
--data_dir data/tasks/gjc \
--output_dir output/deberta/GJCTask \
--num_train_epochs 6 \
--model_dir_or_name bozhou/DeBERTa-base \
--learning_rate 2e-5 \
--train_batch_size 48 \
--fp16 True \
--workers 4 \
--warmup 1000
Test your model and contact us to update the leader board.
- test your model on every task.
- get the best dev set score, use this model to evaluate test set.
- send result of the test set to us.
- maintainers validate the result and then update the leader board.
Task Description
| Task | Train | Dev | Test | Description | Metric | Source |
|---|---|---|---|---|---|---|
| PUNC | 90k | 20k | 20k | Sequence labeling | F1 | Authoritative Texts |
| TLC | 28k | 6k | 6k | Sentence classification | Accuracy | Ancient prose |
| GJC | 100k | 20k | 20k | Sentence classification | Accuracy | Daizhige |
| XuCi | 800 | 200 | 200 | Token similarity | Accuracy | Exam papers |
| WYWRC | 3k | 500 | 500 | Reading comprehension | Accuracy | Exam papers |
| IRC | 3k | 1k | 1k | Reading comprehension | Accuracy | Exam papers |
| WYWMT | 20k | 3k | 3k | Machine Translation | BLEU | online |
| GLNER | 80k | 18k | 18k | Sequence labeling | F1 | \citet{GULIAN2020} |
| FSPC | 3000 | 1000 | 1000 | Sentence classification | Accuracy | THU-FSPC |
Cite us
@inproceedings{zhou-etal-2023-wyweb,
title = "{WYWEB}: A {NLP} Evaluation Benchmark For Classical {C}hinese",
author = "Zhou, Bo and
Chen, Qianglong and
Wang, Tianyu and
Zhong, Xiaomi and
Zhang, Yin",
booktitle = "Findings of the Association for Computational Linguistics: ACL 2023",
month = jul,
year = "2023",
address = "Toronto, Canada",
publisher = "Association for Computational Linguistics",
url = "https://aclanthology.org/2023.findings-acl.204",
doi = "10.18653/v1/2023.findings-acl.204",
pages = "3294--3319"
}