ChildMandarin: A Comprehensive Mandarin Speech Dataset for Young Children Aged 3-5
March 19, 2025 · View on GitHub
⭐ Introduction
This repository contains the ChildMandarin dataset, a comprehensive Mandarin speech dataset specifically designed for young children aged 3 to 5. This dataset aims to address the scarcity of resources in this area and facilitate research in child speech recognition, speaker verification, and related fields.
🚀 Dataset Details
- Age Range: 3-5 years old
- Total Duration: 41.25 hours
- Number of Speakers: 397
- Geographic Coverage: 22 out of 34 provincial-level administrative divisions in China
- Gender Distribution: Balanced across all age groups
- Recording Devices: Smartphones (Android and iPhone)
- Recording Environment: Quiet indoor environments
- Annotation: Character-level manual transcriptions, age, gender, birthplace, device, accent level.
- Content: Unrestricted, focusing on age-appropriate daily communication.
- Data Format: WAV PCM, 16kHz sampling rate, 16-bit precision
Dataset Statistics
| Split | # Speakers | # Utterances | Duration (hrs) | Avg. Utterance Length (s) |
|---|---|---|---|---|
| Train | 317 | 32,658 | 33.35 | 3.68 |
| Dev | 39 | 4,057 | 3.78 | 3.35 |
| Test | 41 | 4,198 | 4.12 | 3.53 |
| Sum | 397 | 40,913 | 41.25 | 3.52 |
More details could be found in our paper ChildMandarin
📐 Experiments
We conducted experiments on Automatic Speech Recognition (ASR) and Speaker Verification (SV) tasks to evaluate the dataset.
1️⃣ ASR Results
Models Trained from Scratch
| Encoder | Loss | # Params | Greedy | Beam | Attention | Attention Rescoring |
|---|---|---|---|---|---|---|
| Transformer | CTC+AED | 29M | 34.55 | 34.4 | 40.61 | 32.15 |
| Conformer | CTC+AED | 31M | 28.73 | 28.72 | 31.60 | 27.38 |
| Conformer | RNN-T+AED | 45M | 37.11 | 37.14 | 33.84 | 37.14 |
| Paraformer | Paraformer | 30M | 31.86 | 28.94 | - | - |
Fine-tuned Pre-trained Models
| Model | # Params | Zero-shot | Fine-tuning |
|---|---|---|---|
| CW | 122M | 18.05 | 13.66 |
| Whisper-tiny | 39M | 67.63 | 28.78 |
| Whisper-base | 74M | 51.49 | 23.33 |
| Whisper-small | 244M | 37.99 | 17.45 |
| Whisper-medium | 769M | 28.55 | 18.97 |
| Whisper-large-v2 | 1,550M | 29.43 | - |
More Pre-trained Models
| Model | # Params | Zero-shot |
|---|---|---|
| Qwen-Audio | 7.7B | 20.39 |
| Qwen2-Audio | 8.2B | 11.54 |
| SenseVoice (Small) | 234M | 11.89 |
2️⃣ SV Results
| Model | # Params | Dim | Dev (%) | EER (%) | minDCF | EER (%) | minDCF |
|---|---|---|---|---|---|---|---|
| x-vector | 4.2M | 512 | 75.4 | 8.91 | 0.7198 | 25.92 | 0.9780 |
| ECAPA-TDNN | 20.8M | 192 | 84.6 | 13.72 | 0.8697 | 27.77 | 0.9490 |
| ResNet-TDNN | 15.5M | 256 | 91.9 | 9.57 | 0.6597 | 22.11 | 0.9044 |
🤗 Dataset Download
You can access the ChildMandarin dataset on HuggingFace Datasets:
https://huggingface.co/datasets/BAAI/ChildMandarin
📚 Cite me
@article{zhou2024childmandarin,
title={ChildMandarin: A Comprehensive Mandarin Speech Dataset for Young Children Aged 3-5},
author={Zhou, Jiaming and Wang, Shiyao and Zhao, Shiwan and He, Jiabei and Sun, Haoqin and Wang, Hui and Liu, Cheng and Kong, Aobo and Guo, Yujie and Qin, Yong},
journal={arXiv preprint arXiv:2409.18584},
year={2024}
}