ASR-Training-Data-Chunker
April 13, 2025 ยท View on GitHub
A Python tool for chunking text files into smaller segments based on speaking speed and desired duration, optimized for ASR (Automatic Speech Recognition) training data preparation.
Purpose
This tool helps you split large text files into smaller chunks that:
- Take a specific amount of time to read aloud based on your speaking speed
- Break at logical sentence boundaries to maintain context
- Can be used for recording voice samples for ASR training
Installation
No installation required. Simply clone this repository:
git clone https://github.com/danielrosehill/ASR-Training-Data-Chunker.git
cd ASR-Training-Data-Chunker
Usage
- Place your source text file in the
chunker/source/directory assource.txt - Run the chunker script:
python chunker.py
-
Follow the prompts to enter:
- Your speaking speed in words per minute (WPM)
- Desired chunk duration in minutes
-
The script will create chunked files in the
chunker/chunked/directory
Determining Your Speaking Speed
You can measure your speaking speed using online tools like:
Average speaking speeds:
- Slow: ~120 WPM
- Average: ~150 WPM
- Fast: ~180+ WPM
Example
For a text with 9000 words:
- Speaking speed: 150 WPM
- Desired chunk duration: 2 minutes
The script will create approximately 30 chunks, each containing around 300 words and taking about 2 minutes to read aloud.
License
MIT