SCUT-C2MChn Dataset

April 21, 2024 · View on GitHub

The C2MChn dataset for the research of machine translation from classical to modern Chinese is now released by Deep Leaning and Visual Computing Lab of South China University of Technology. To the best of our knowledge, this is the first high-quality and comprehensive dataset that not only covers traditional history books but also the Buddhist classics, Confucian classics, Taoist classics, and some other domains.

The total dataset can be downloaded through the following link:

Note:

The C2MChn dataset can only be used for non-commercial research purposes.

The right of final interpretation belongs to DLVC.

License

The C2MChn dataset should be used and distributed under the Creative Attribution-NonCommercial-NoDerivatives 4.0 International (CC BY-NC-ND 4.0) License for non-commercial research purposes.

Statistics

Domains Total Training Validation Test
pairs cla/s mod/s pairs cla/s mod/s pairs cla/s mod/s pairs cla/s mod/s
History 283600 19.8 28.7 273969 19.8 28.7 4707 20.2 29.4 4924 20.5 29.9
Buddhism 223602 18.2 28.6 216842 18.1 28.5 3509 18.9 29.8 3251 20.4 32.5
Confucianism 49337 18.3 30.5 47665 18.3 30.4 824 18.6 30.7 858 18.9 31.6
Taoism 10433 19.9 32.3 10061 19.9 32.7 187 20.1 32.6 185 19.4 31.4
Agronomy 9748 12.4 18.6 9459 12.3 18.5 162 12.8 19.4 127 14.0 20.8
Short 6820 16.5 30.2 6623 16.5 30.2 99 15.6 29.5 98 17.5 31.3
Others 31183 20.1 32.1 30114 20.1 32.0 512 21.3 33.8 557 21.1 33.2
All 614723 18.9 28.9 594723 18.9 28.8 10000 19.5 29.8 10000 20.3 31.0

Note “pairs”, “cla/s” and “mod/s” denote the number of parallel sentence pairs, the average number of word tokens per classical Chinese sentence and the average number of word tokens per modern Chinese sentence, respectively.

Directory Format

The dataset is organized in the following directory format:

├── SCUT-C2MChn
    ├── train
    │   ├── train.cch
    │   │── train.mch
    │   └── train.domain
    ├── valid
    │   ├── valid.cch
    │   │── valid.mch
    │   └── valid.domain
    ├── test
        ├── test.cch
        │── test.mch
        └── test.domain

Citation and Contact

Please consider to cite our paper when you use our dataset:

@InProceedings{Jiang_2023_NLPCC,
    title     = {Towards Better Translations from Classical to Modern Chinese: A New Dataset and a New Method},
    author    = {Jiang, Zongyuan and Wang, Jiapeng and Cao, Jiahuan and Gao, Xue and Jin, Lianwen},
    booktitle = {CCF International Conference on Natural Language Processing and Chinese Computing},
    pages={387--399},
    year={2023},
    organization={Springer}
}