WYWEB

December 13, 2023 ยท View on GitHub

An evaluation bentchmark for classical Chinese. This work has been accepted by Findings of ACL 2023.

Classical Chinese is a treasure of the entire human cultural history. We contribute this work with the hope of helping the entire community to be more prosperous. Our work will be an open, community-driven project which improves with the advancement of technology.

We hope more people join in to make this benchmark better and more useful.

Leader-board

Online leader-board

See WYWEB on CADAL for the official leader-board.

Main

ModelsAvg.PUNCGLNERGJCFSPCTLCXuciWYWRCIRC
Human88.092.494.390.380.089.085.380.092.3
DeBERTa-base75.983.386.785.261.186.772.445.186.7
GuwenBERT-base72.982.582.884.861.385.171.728.086.8
GuwenBERT-large75.683.186.184.958.587.673.444.487.8
GuwenBERT-base-fs74.682.984.884.261.086.770.042.185.3
RoBERTa-CCBC74.582.584.784.559.585.073.240.786.1
RoBERTa-CCLC75.382.886.184.758.687.174.941.086.9
SikuBERT73.780.882.882.260.982.470.444.085.8
SikuRoBERTa73.581.482.882.562.283.868.541.085.8
RoBERTa-wwm-ext72.178.879.881.359.278.371.042.186.2

WYWMT

ModelBLEUchrF2TERROUGE-1ROUGE-2ROUGE-L
Human45.644.234.477.450.776.2
guwenbert-base40.138.137.572.546.070.3
guwenbert-large38.837.238.170.143.767.7
guwenbert-base-fs36.335.239.268.341.265.7
roberta-CCBC39.137.136.871.444.969.3
roberta-CCLC39.838.036.471.645.369.3
SikuBERT38.836.237.972.045.569.8
SikuRoBERTa39.136.537.772.245.770.0
DeBERTa-base39.537.835.971.944.268.7
Roberta-wwm-ext38.035.839.169.943.266.7

How to test new models?

This is an evaluation benchmark for classical Chinese NLP providing several tasks. Researchers could quickly evaluate pre-trained language models with a few lines of code using the evaluation toolkit.

Quick Run The Base Line

python run.py  \
                --tag wywweb \
                --do_train \
                --max_seq_len 512 \
                --dump 1000 \
                --task_name GJCTask \
                --data_dir data/tasks/gjc \
                --output_dir output/deberta/GJCTask \
                --num_train_epochs 6 \
                --model_dir_or_name bozhou/DeBERTa-base \
                --learning_rate 2e-5 \
                --train_batch_size 48 \
                --fp16 True \
                --workers 4 \
                --warmup 1000 

Test your model and contact us to update the leader board.

  • test your model on every task.
  • get the best dev set score, use this model to evaluate test set.
  • send result of the test set to us.
  • maintainers validate the result and then update the leader board.

Task Description

TaskTrainDevTestDescriptionMetricSource
PUNC90k20k20kSequence labelingF1Authoritative Texts
TLC28k6k6kSentence classificationAccuracyAncient prose
GJC100k20k20kSentence classificationAccuracyDaizhige
XuCi800200200Token similarityAccuracyExam papers
WYWRC3k500500Reading comprehensionAccuracyExam papers
IRC3k1k1kReading comprehensionAccuracyExam papers
WYWMT20k3k3kMachine TranslationBLEUonline
GLNER80k18k18kSequence labelingF1\citet{GULIAN2020}
FSPC300010001000Sentence classificationAccuracyTHU-FSPC

Cite us

@inproceedings{zhou-etal-2023-wyweb,
    title = "{WYWEB}: A {NLP} Evaluation Benchmark For Classical {C}hinese",
    author = "Zhou, Bo  and
      Chen, Qianglong  and
      Wang, Tianyu  and
      Zhong, Xiaomi  and
      Zhang, Yin",
    booktitle = "Findings of the Association for Computational Linguistics: ACL 2023",
    month = jul,
    year = "2023",
    address = "Toronto, Canada",
    publisher = "Association for Computational Linguistics",
    url = "https://aclanthology.org/2023.findings-acl.204",
    doi = "10.18653/v1/2023.findings-acl.204",
    pages = "3294--3319"
}