Vakyansh Open Source Models
September 16, 2022 ยท View on GitHub
- Pretrained ASR Models
- Finetuned ASR Models
- Language Models
- Punctuation Models
- TTS Models
- Gender Classification Model
- Language Identification Models
- Interspeech 2021 ASR Models
Pretrained ASR Models
| Pretrained Model | Description | Architecture | Hours |
|---|---|---|---|
| Vakyansh-Conformer-SSL | This model was pre-trained using Nemo toolkit with 34,000 hours unlabeled audio in 39 Indian languages. This includes 15,000 hours of news recordings available on the internet, 10,000 hours of YouTube audios and other audio data. In addition, 9,000 hours of Indian English audio data was taken from NPTEL lectures open sourced by AI4Bharat. This model was trained in collaboration with NVIDIA (NVIDIA Graphics Pvt Ltd). We thank NVIDIA for providing the compute resources to train this model. | Conformer-Large | 34,000 |
| CLSRIL-23 | Cross Lingual Speech Representations for Indic Languages, Contains 10,000 hours of training data from 23 Indic Languages. Citation: https://arxiv.org/abs/2107.07402 | wav2vec2-Base | 10,000 |
| hindi_pretrained_4kh | Trained on 4200 hours of Hindi Data | wav2vec2-Base | 4,200 |
| kannada_pretrained_1400h | Trained on 1400 hours of Kannada data | wav2vec2-XLSR | 1,400 |
Finetuned ASR Models
Conformer based models
| Language | Pretrained Model | Finetuned Model | Finetuned Hours | Arch |
|---|---|---|---|---|
| Hindi | Vakyansh Conformer SSL | hindi_large_ssl_2500 | 2,500 h | Large |
| Indian English | Vakyansh Conformer SSL | indian_en_large_ssl_700 | 700 h | Large |
| Kannada | Vakyansh Conformer SSL | kannada_large_ssl_1000 | 1,000 h | Large |
| Punjabi | Vakyansh Conformer SSL | punjabi_large_ssl_500 | 500 h | Large |
| Tamil | Vakyansh Conformer SSL | tamil_large_ssl_900 | 900 h | Large |
wav2vec2 based models
Citation: https://arxiv.org/abs/2203.16512
Language Models
Language models integrate with finetuned models.
Dataset Credits: We thanks AI4Bharat for open sourcing the Indic-Corp dataset. Link. We modified the original data by tokenizing and removing duplicates.
Domain Specific Language Models
| Language | Type | Domain | Lexicon | LM | Text Corpus |
|---|---|---|---|---|---|
| English | kenlm 5-gram | Biomedical | bio_lexicon | bio_lm | bio_lm_eng_text |
Punctuation Models
| Language | Model | Data |
|---|---|---|
| Hindi | hi.zip | hindi_data |
| Assamese | as.zip | assamese_data |
| Bengali | bn.zip | bengali_data |
| Gujarati | gu.zip | gujarati_data |
| Kannada | kn.zip | kannada_data |
| Malayalam | ml.zip | malayalam_data |
| Marathi | mr.zip | marathi_data |
| Odia | or.zip | odia_data |
| Punjabi | pa.zip | punjabi_data |
| Tamil | ta.zip | tamil_data |
| Telugu | te.zip | telugu_data |
Dataset Credits: We thank AI4Bharat for open sourcing the Indic-Corp dataset. Link. We modified the original data by tokenizing and removing duplicates.
TTS Models
Below models are trained using Glow TTS and hifi GAN combination.
| Language | Language Code | Gender | glow ckpt | hifi-gan ckpt |
|---|---|---|---|---|
| Hindi | hi | Female | hi_0_glow | hi_0_hifi |
| Hindi | hi | Male | hi_1_glow | hi_1_hifi |
| Kannada | kn | Female | kn_0_glow | kn_0_1_hifi |
| Kannada | kn | Male | kn_1_glow | kn_0_1_hifi |
| Tamil | ta | Female | ta_0_glow | ta_0_1_hifi |
| Tamil | ta | Male | ta_1_glow | ta_0_1_hifi |
| Telugu | te | Female | te_0_glow | te_0_1_hifi |
| Telugu | te | Male | te_1_glow | te_0_1_hifi |
| Odia | or | Female | or_0_glow | or_0_1_hifi |
| Odia | or | Male | or_1_glow | or_0_1_hifi |
| Malayalam | ml | Female | ml_0_glow | ml_0_hifi |
| Malayalam | ml | Male | ml_1_glow | ml_1_hifi |
| Marathi | mr | Female | mr_0_glow | mr_1_hifi |
| Gujarati | gu | Male | gu_0_glow | gu_0_hifi |
| Bengali | bn | Female | bn_0_glow | bn_0_1_hifi |
| Bengali | bn | Male | bn_1_glow | bn_0_1_hifi |
| English | en | Female | en_0_glow | en_0_hifi |
| English | en | Male | en_1_glow | en_1_hifi |
Dataset Credits: We thanks IITM for open sourcing Indic-TTS dataset. Link
Gender Classification Model
| Type | Model Type | Model |
|---|---|---|
| Gender Classification | SVC | Model |
Language Identification Models
| Type | Model |
|---|---|
| Hindi_vs_Others | Model |
| Tamil_vs_Others | Model |
Interspeech 2021 ASR Models
| Language | Pretrained Model | Finetuned Model | Dictionary | Single Model for Inference |
|---|---|---|---|---|
| Telugu | CLSRIL-23 | te_40h_interspeech | dict | telugu_infer_interspeech |
| Tamil | CLSRIL-23 | ta_40h_interspeech | dict | tamil_infer_interspeech |
| Gujarati | CLSRIL-23 | gu_40h_interspeech | dict | gujarati_infer_interspeech |
| Hinglish | CLSRIL-23 | hinglish_interspeech | dict | hinglish_infer_interspeech |
Citation
If you use any of our resources, please cite the following article:
@misc{chadha2022vakyansh,
title={Vakyansh: ASR Toolkit for Low Resource Indic languages},
author={Harveen Singh Chadha and Anirudh Gupta and Priyanshi Shah and Neeraj Chhimwal and Ankur Dhuriya and Rishabh Gaur and Vivek Raghavan},
year={2022},
eprint={2203.16512},
archivePrefix={arXiv},
primaryClass={cs.CL}
}
If you use the pretrained model (CLSRIL-23) please cite the following article:
@misc{gupta2021clsril23,
title={CLSRIL-23: Cross Lingual Speech Representations for Indic Languages},
author={Anirudh Gupta and Harveen Singh Chadha and Priyanshi Shah and Neeraj Chimmwal and Ankur Dhuriya and Rishabh Gaur and Vivek Raghavan},
year={2021},
eprint={2107.07402},
archivePrefix={arXiv},
primaryClass={cs.CL}
}