ChildMandarin: A Comprehensive Mandarin Speech Dataset for Young Children Aged 3-5

March 19, 2025 · View on GitHub

Hugging Face Datasets Hugging Face Datasets License: CC BY-NC-SA-4.0

⭐ Introduction

This repository contains the ChildMandarin dataset, a comprehensive Mandarin speech dataset specifically designed for young children aged 3 to 5. This dataset aims to address the scarcity of resources in this area and facilitate research in child speech recognition, speaker verification, and related fields.

🚀 Dataset Details

  • Age Range: 3-5 years old
  • Total Duration: 41.25 hours
  • Number of Speakers: 397
  • Geographic Coverage: 22 out of 34 provincial-level administrative divisions in China
  • Gender Distribution: Balanced across all age groups
  • Recording Devices: Smartphones (Android and iPhone)
  • Recording Environment: Quiet indoor environments
  • Annotation: Character-level manual transcriptions, age, gender, birthplace, device, accent level.
  • Content: Unrestricted, focusing on age-appropriate daily communication.
  • Data Format: WAV PCM, 16kHz sampling rate, 16-bit precision

Dataset Statistics

Split# Speakers# UtterancesDuration (hrs)Avg. Utterance Length (s)
Train31732,65833.353.68
Dev394,0573.783.35
Test414,1984.123.53
Sum39740,91341.253.52

More details could be found in our paper ChildMandarin

📐 Experiments

We conducted experiments on Automatic Speech Recognition (ASR) and Speaker Verification (SV) tasks to evaluate the dataset.

1️⃣ ASR Results

Models Trained from Scratch

EncoderLoss# ParamsGreedyBeamAttentionAttention Rescoring
TransformerCTC+AED29M34.5534.440.6132.15
ConformerCTC+AED31M28.7328.7231.6027.38
ConformerRNN-T+AED45M37.1137.1433.8437.14
ParaformerParaformer30M31.8628.94--

Fine-tuned Pre-trained Models

Model# ParamsZero-shotFine-tuning
CW122M18.0513.66
Whisper-tiny39M67.6328.78
Whisper-base74M51.4923.33
Whisper-small244M37.9917.45
Whisper-medium769M28.5518.97
Whisper-large-v21,550M29.43-

More Pre-trained Models

Model# ParamsZero-shot
Qwen-Audio7.7B20.39
Qwen2-Audio8.2B11.54
SenseVoice (Small)234M11.89

2️⃣ SV Results

Model# ParamsDimDev (%)EER (%)minDCFEER (%)minDCF
x-vector4.2M51275.48.910.719825.920.9780
ECAPA-TDNN20.8M19284.613.720.869727.770.9490
ResNet-TDNN15.5M25691.99.570.659722.110.9044

🤗 Dataset Download

You can access the ChildMandarin dataset on HuggingFace Datasets:

https://huggingface.co/datasets/BAAI/ChildMandarin

📚 Cite me

@article{zhou2024childmandarin,
  title={ChildMandarin: A Comprehensive Mandarin Speech Dataset for Young Children Aged 3-5},
  author={Zhou, Jiaming and Wang, Shiyao and Zhao, Shiwan and He, Jiabei and Sun, Haoqin and Wang, Hui and Liu, Cheng and Kong, Aobo and Guo, Yujie and Qin, Yong},
  journal={arXiv preprint arXiv:2409.18584},
  year={2024}
}