ted_preprocess
February 13, 2018 ยท View on GitHub
Preprocess the prosodically annotated TED corpus. Annotations of talks are prepared using: https://github.com/laic/prosody
Processing a single talk with tedDataToPickle.py:
###Input files:
-
.word.txt(cmd input as-w) -
.word.txt.norm.align(cmd input as-l) -
.aggs.alignword.txtfor fundemental frequency and intensity (cmd input as-fand-i)
###Output file:
- CSV file with word aligned features (cmd input as
-o)
Sample run:
python tedDataToPickle.py -w data/raw/txt-sent/0001.word.txt -l data/raw/txt-sent/0001.word.txt.norm.align -f data/raw/derived/segs/f0/0001.aggs.alignword.txt -i data/raw/derived/segs/i0/0001.aggs.alignword.txt -o 0001.csv
s
Batch process talks:
./processAllTedData.sh data/raw data/compiled
Obtaining punkProse processable corpus
To collect samples from talks into one corpus partitioned into training/development/testing sets:
python corpusMaker.py -i data/compiled/ -o data/corpus -r 0.7 -v 2 -l 50
(Training and development set are sampled into sequences of size 50 (-l). Training set constitutes 0.7 (-r) of all data. Word vocabulary is created with minimum word occurence 2 (-v).)