SCUT-C2MChn Dataset
April 21, 2024 · View on GitHub
The C2MChn dataset for the research of machine translation from classical to modern Chinese is now released by Deep Leaning and Visual Computing Lab of South China University of Technology. To the best of our knowledge, this is the first high-quality and comprehensive dataset that not only covers traditional history books but also the Buddhist classics, Confucian classics, Taoist classics, and some other domains.
The total dataset can be downloaded through the following link:
Note:
The C2MChn dataset can only be used for non-commercial research purposes.
The right of final interpretation belongs to DLVC.
License
The C2MChn dataset should be used and distributed under the Creative Attribution-NonCommercial-NoDerivatives 4.0 International (CC BY-NC-ND 4.0) License for non-commercial research purposes.
Statistics
| Domains | Total | Training | Validation | Test | ||||||||
|---|---|---|---|---|---|---|---|---|---|---|---|---|
| pairs | cla/s | mod/s | pairs | cla/s | mod/s | pairs | cla/s | mod/s | pairs | cla/s | mod/s | |
| History | 283600 | 19.8 | 28.7 | 273969 | 19.8 | 28.7 | 4707 | 20.2 | 29.4 | 4924 | 20.5 | 29.9 |
| Buddhism | 223602 | 18.2 | 28.6 | 216842 | 18.1 | 28.5 | 3509 | 18.9 | 29.8 | 3251 | 20.4 | 32.5 |
| Confucianism | 49337 | 18.3 | 30.5 | 47665 | 18.3 | 30.4 | 824 | 18.6 | 30.7 | 858 | 18.9 | 31.6 |
| Taoism | 10433 | 19.9 | 32.3 | 10061 | 19.9 | 32.7 | 187 | 20.1 | 32.6 | 185 | 19.4 | 31.4 |
| Agronomy | 9748 | 12.4 | 18.6 | 9459 | 12.3 | 18.5 | 162 | 12.8 | 19.4 | 127 | 14.0 | 20.8 |
| Short | 6820 | 16.5 | 30.2 | 6623 | 16.5 | 30.2 | 99 | 15.6 | 29.5 | 98 | 17.5 | 31.3 |
| Others | 31183 | 20.1 | 32.1 | 30114 | 20.1 | 32.0 | 512 | 21.3 | 33.8 | 557 | 21.1 | 33.2 |
| All | 614723 | 18.9 | 28.9 | 594723 | 18.9 | 28.8 | 10000 | 19.5 | 29.8 | 10000 | 20.3 | 31.0 |
Note “pairs”, “cla/s” and “mod/s” denote the number of parallel sentence pairs, the average number of word tokens per classical Chinese sentence and the average number of word tokens per modern Chinese sentence, respectively.
Directory Format
The dataset is organized in the following directory format:
├── SCUT-C2MChn
├── train
│ ├── train.cch
│ │── train.mch
│ └── train.domain
├── valid
│ ├── valid.cch
│ │── valid.mch
│ └── valid.domain
├── test
├── test.cch
│── test.mch
└── test.domain
Citation and Contact
Please consider to cite our paper when you use our dataset:
@InProceedings{Jiang_2023_NLPCC,
title = {Towards Better Translations from Classical to Modern Chinese: A New Dataset and a New Method},
author = {Jiang, Zongyuan and Wang, Jiapeng and Cao, Jiahuan and Gao, Xue and Jin, Lianwen},
booktitle = {CCF International Conference on Natural Language Processing and Chinese Computing},
pages={387--399},
year={2023},
organization={Springer}
}