README.md

July 22, 2021 · View on GitHub

Vāksañcayaḥ - Sanskrit speech corpus has more than 78 hours of data and contains recordings of 45,953 sentences with a sampling rate of 22 KHz. The content is mainly readings of various texts spanning many Śāstras of Saṃskṛt literature and also includes contemporary stories, radio program, extempore discourse, etc. The summary datasheet associated with this corpus can be accessed here - Link. Please download the corpus from https://www.cse.iitb.ac.in/~asr/.

Environments

python version: 3.7.3
Model files
- List of the speakers used in the train, validation, test and out-of-domain-test split are given in the README file of corpus.
- SRILM LM link
Results for different model
- In-domain test data WER : 21.94 for the best performing model (SLP1 as the script and BPE splits as the LM unit).
- Out-of-domain test data WER for different speakers can be referred to in the paper.

Recipe

This Kaldi recipe is based on subword - Vowel Split and Byte Pair Encoding. For word based we used Wall Street Journal recipe

Training

Download the vowel splitter (This requires the text to be in SLP1 format)

Download the pre-trained model

Download the processed dataset

Convert the audio files for testing from .mp3 files to .wav files before testing using the script given with the corpus.
We used our best performing model(SLP1 as the script and BPE splits as the LM unit) for testing Out-of-domain data.
In-domain test data link (test.zip)
Out-of-domain test data link (truetest.zip)

Evaluate

From pre-trained model (SLP vowel split)

./decode.sh test
# | WER : 18.12
./decode.sh truetest
# | WER : 34.88

Publications

Devaraja Adiga and Rishabh Kumar and Amrith Krishna and Preethi Jyothi and Ganesh Ramakrishnan and Pawan Goyal, Automatic Speech Recognition in Sanskrit: A New Speech Corpus and Modelling Insights, In ACL 2021.