SCUT-C2MChn Dataset

April 21, 2024 · View on GitHub

The C2MChn dataset for the research of machine translation from classical to modern Chinese is now released by Deep Leaning and Visual Computing Lab of South China University of Technology. To the best of our knowledge, this is the first high-quality and comprehensive dataset that not only covers traditional history books but also the Buddhist classics, Confucian classics, Taoist classics, and some other domains.

The total dataset can be downloaded through the following link:

Note:

The C2MChn dataset can only be used for non-commercial research purposes.

The right of final interpretation belongs to DLVC.

License

The C2MChn dataset should be used and distributed under the Creative Attribution-NonCommercial-NoDerivatives 4.0 International (CC BY-NC-ND 4.0) License for non-commercial research purposes.

Statistics

Domains	Total			Training			Validation			Test
Domains	pairs	cla/s	mod/s	pairs	cla/s	mod/s	pairs	cla/s	mod/s	pairs	cla/s	mod/s
History	283600	19.8	28.7	273969	19.8	28.7	4707	20.2	29.4	4924	20.5	29.9
Buddhism	223602	18.2	28.6	216842	18.1	28.5	3509	18.9	29.8	3251	20.4	32.5
Confucianism	49337	18.3	30.5	47665	18.3	30.4	824	18.6	30.7	858	18.9	31.6
Taoism	10433	19.9	32.3	10061	19.9	32.7	187	20.1	32.6	185	19.4	31.4
Agronomy	9748	12.4	18.6	9459	12.3	18.5	162	12.8	19.4	127	14.0	20.8
Short	6820	16.5	30.2	6623	16.5	30.2	99	15.6	29.5	98	17.5	31.3
Others	31183	20.1	32.1	30114	20.1	32.0	512	21.3	33.8	557	21.1	33.2
All	614723	18.9	28.9	594723	18.9	28.8	10000	19.5	29.8	10000	20.3	31.0

Note “pairs”, “cla/s” and “mod/s” denote the number of parallel sentence pairs, the average number of word tokens per classical Chinese sentence and the average number of word tokens per modern Chinese sentence, respectively.

Directory Format

The dataset is organized in the following directory format:

├── SCUT-C2MChn
    ├── train
    │   ├── train.cch
    │   │── train.mch
    │   └── train.domain
    ├── valid
    │   ├── valid.cch
    │   │── valid.mch
    │   └── valid.domain
    ├── test
        ├── test.cch
        │── test.mch
        └── test.domain

Citation and Contact

Please consider to cite our paper when you use our dataset:

@InProceedings{Jiang_2023_NLPCC,
    title     = {Towards Better Translations from Classical to Modern Chinese: A New Dataset and a New Method},
    author    = {Jiang, Zongyuan and Wang, Jiapeng and Cao, Jiahuan and Gao, Xue and Jin, Lianwen},
    booktitle = {CCF International Conference on Natural Language Processing and Chinese Computing},
    pages={387--399},
    year={2023},
    organization={Springer}
}