Vakyansh Open Source Models

September 16, 2022 · View on GitHub

Pretrained ASR Models
Finetuned ASR Models
Language Models
Punctuation Models
TTS Models
Gender Classification Model
Language Identification Models
Interspeech 2021 ASR Models

Pretrained ASR Models

Pretrained Model	Description	Architecture	Hours
Vakyansh-Conformer-SSL	This model was pre-trained using Nemo toolkit with 34,000 hours unlabeled audio in 39 Indian languages. This includes 15,000 hours of news recordings available on the internet, 10,000 hours of YouTube audios and other audio data. In addition, 9,000 hours of Indian English audio data was taken from NPTEL lectures open sourced by AI4Bharat. This model was trained in collaboration with NVIDIA (NVIDIA Graphics Pvt Ltd). We thank NVIDIA for providing the compute resources to train this model.	Conformer-Large	34,000
CLSRIL-23	Cross Lingual Speech Representations for Indic Languages, Contains 10,000 hours of training data from 23 Indic Languages. Citation: https://arxiv.org/abs/2107.07402	wav2vec2-Base	10,000
hindi_pretrained_4kh	Trained on 4200 hours of Hindi Data	wav2vec2-Base	4,200
kannada_pretrained_1400h	Trained on 1400 hours of Kannada data	wav2vec2-XLSR	1,400

Finetuned ASR Models

Conformer based models

Repo

Language	Pretrained Model	Finetuned Model	Finetuned Hours	Arch
Hindi	Vakyansh Conformer SSL	hindi_large_ssl_2500	2,500 h	Large
Indian English	Vakyansh Conformer SSL	indian_en_large_ssl_700	700 h	Large
Kannada	Vakyansh Conformer SSL	kannada_large_ssl_1000	1,000 h	Large
Punjabi	Vakyansh Conformer SSL	punjabi_large_ssl_500	500 h	Large
Tamil	Vakyansh Conformer SSL	tamil_large_ssl_900	900 h	Large

wav2vec2 based models

Repo

Citation: https://arxiv.org/abs/2203.16512

Language	Pretrained Model	Finetuned Model	Dictionary	Single Model for Inference	Finetuned Hours	TS model
Hindi	CLSRIL-23	him_4200	dict	hindi_infer	4200 h	hindi_ts
Indian English	CLSRIL-23	enm_700	dict	english_infer	700 h	english_ts
Kannada	CLSRIL-23	knm_560	dict	kannada_infer	560 h	kannada_ts
Tamil	CLSRIL-23	tam_250	dict	tamil_infer	250 h	tamil_ts
Bengali	CLSRIL-23	bnm_200	dict	bengali_infer	200 h	bengali_ts
Nepali	CLSRIL-23	nem_130	dict	nepali_infer	130 h	nepali_ts
Telugu	CLSRIL-23	tem_100	dict	telugu_infer	100 h	telugu_ts
Gujarati	CLSRIL-23	gum_100	dict	gujarati_infer	100 h	gujarati_ts
Marathi	CLSRIL-23	mrm_100	dict	marathi_infer	100 h	marathi_ts
Odia	CLSRIL-23	orm_100	dict	odia_infer	100 h	odia_ts
Sanskrit	CLSRIL-23	sam_60	dict	sanskrit_infer	60 h	sanskrit_ts
Maithili	CLSRIL-23	maim_50	dict	maithili_infer	50 h	maithili_ts
Urdu	CLSRIL-23	urm_60h	dict	urdu_infer	60h	urdu_ts
Punjabi	CLSRIL-23	pam_10h	dict	punjabi_infer	10 h	punjabi_ts
Dogri	CLSRIL-23	doi_55h	dict	dogri_infer	55 h	dogri_ts
Malayalam	CLSRIL-23	mlm_8h	dict	malayalam_infer	8 h	malayalam_ts
Bhojpuri	CLSRIL-23	bhom_60h	dict	bhojpuri_infer	60 h	bhojpuri_ts
Assamese	CLSRIL-23	asm_8h	dict	assamese_infer	8 h	assamese_ts

Language Models

Repo

Language models integrate with finetuned models.

Language	Type	Lexicon	LM	Text Corpus
Hindi	kenlm 5-gram	hindi_lexicon	hindi_lm	hindi_text
Indian English	kenlm 5-gram	english_lexicon	english_lm	english_text
Kannada	kenlm 5-gram	kannada_lexicon	kannada_lm	kannada_text
Tamil	kenlm 5-gram	tamil_lexicon	tamil_lm	tamil_text
Bengali	kenlm 5-gram	bengali_lexicon	bengali_lm	bengali_text
Nepali	kenlm 5-gram	nepali_lexicon	nepali_lm	nepali_text
Telugu	kenlm 5-gram	telugu_lexicon	telugu_lm	telugu_text
Gujarati	kenlm 5-gram	gujarati_lexicon	gujarati_lm	gujarati_text
Marathi	kenlm 5-gram	marathi_lexicon	marathi_lm	marathi_text
Odia	kenlm 5-gram	odia_lexicon	odia_lm	odia_lm
Sanskrit	kenlm 5-gram	sanskrit_lexicon	sanskrit_lm	sanskrit_text
Maithili	kenlm 5-gram	maithili_lexicon	maithili_lm	maithili_text
Urdu	kenlm 5-gram	urdu_lexicon	urdu_lm	urdu_text
Punjabi	kenlm 5-gram	punjabi_lexicon	punjabi_lm	punjabi_text
Dogri	kenlm 5-gram	dogri_lexicon	dogri_lm	dogri_text
Malayalam	kenlm 5-gram	malayalam_lexicon	malayalam_lm	malayalam_text
Bhojpuri	kenlm 5-gram	bhojpuri_lexicon	bhojpuri_lm	bhojpuri_text
Rajasthani	kenlm 5-gram	rajasthani_lexicon	rajasthani_lm	rajasthani_text
Assamese	kenlm 5-gram	assamese_lexicon	assamese_lm	assamese_text
Hinglish	kenlm 5-gram	hinglish_lexicon	hinglish_lm	hinglish_text

Dataset Credits: We thanks AI4Bharat for open sourcing the Indic-Corp dataset. Link. We modified the original data by tokenizing and removing duplicates.

Domain Specific Language Models

Language	Type	Domain	Lexicon	LM	Text Corpus
English	kenlm 5-gram	Biomedical	bio_lexicon	bio_lm	bio_lm_eng_text

Punctuation Models

Training Repo

Inference Repo

Language	Model	Data
Hindi	hi.zip	hindi_data
Assamese	as.zip	assamese_data
Bengali	bn.zip	bengali_data
Gujarati	gu.zip	gujarati_data
Kannada	kn.zip	kannada_data
Malayalam	ml.zip	malayalam_data
Marathi	mr.zip	marathi_data
Odia	or.zip	odia_data
Punjabi	pa.zip	punjabi_data
Tamil	ta.zip	tamil_data
Telugu	te.zip	telugu_data

Dataset Credits: We thank AI4Bharat for open sourcing the Indic-Corp dataset. Link. We modified the original data by tokenizing and removing duplicates.

TTS Models

Below models are trained using Glow TTS and hifi GAN combination.

Repo

Language	Language Code	Gender	glow ckpt	hifi-gan ckpt
Hindi	hi	Female	hi_0_glow	hi_0_hifi
Hindi	hi	Male	hi_1_glow	hi_1_hifi
Kannada	kn	Female	kn_0_glow	kn_0_1_hifi
Kannada	kn	Male	kn_1_glow	kn_0_1_hifi
Tamil	ta	Female	ta_0_glow	ta_0_1_hifi
Tamil	ta	Male	ta_1_glow	ta_0_1_hifi
Telugu	te	Female	te_0_glow	te_0_1_hifi
Telugu	te	Male	te_1_glow	te_0_1_hifi
Odia	or	Female	or_0_glow	or_0_1_hifi
Odia	or	Male	or_1_glow	or_0_1_hifi
Malayalam	ml	Female	ml_0_glow	ml_0_hifi
Malayalam	ml	Male	ml_1_glow	ml_1_hifi
Marathi	mr	Female	mr_0_glow	mr_1_hifi
Gujarati	gu	Male	gu_0_glow	gu_0_hifi
Bengali	bn	Female	bn_0_glow	bn_0_1_hifi
Bengali	bn	Male	bn_1_glow	bn_0_1_hifi
English	en	Female	en_0_glow	en_0_hifi
English	en	Male	en_1_glow	en_1_hifi

Dataset Credits: We thanks IITM for open sourcing Indic-TTS dataset. Link

Gender Classification Model

Repo

Type	Model Type	Model
Gender Classification	SVC	Model

Language Identification Models

Repo

Type	Model
Hindi_vs_Others	Model
Tamil_vs_Others	Model

Interspeech 2021 ASR Models

Comp Link

Language	Pretrained Model	Finetuned Model	Dictionary	Single Model for Inference
Telugu	CLSRIL-23	te_40h_interspeech	dict	telugu_infer_interspeech
Tamil	CLSRIL-23	ta_40h_interspeech	dict	tamil_infer_interspeech
Gujarati	CLSRIL-23	gu_40h_interspeech	dict	gujarati_infer_interspeech
Hinglish	CLSRIL-23	hinglish_interspeech	dict	hinglish_infer_interspeech

Citation

If you use any of our resources, please cite the following article:

@misc{chadha2022vakyansh,
    title={Vakyansh: ASR Toolkit for Low Resource Indic languages},
    author={Harveen Singh Chadha and Anirudh Gupta and Priyanshi Shah and Neeraj Chhimwal and Ankur Dhuriya and Rishabh Gaur and Vivek Raghavan},
    year={2022},
    eprint={2203.16512},
    archivePrefix={arXiv},
    primaryClass={cs.CL}
}

If you use the pretrained model (CLSRIL-23) please cite the following article:

@misc{gupta2021clsril23,
      title={CLSRIL-23: Cross Lingual Speech Representations for Indic Languages}, 
      author={Anirudh Gupta and Harveen Singh Chadha and Priyanshi Shah and Neeraj Chimmwal and Ankur Dhuriya and Rishabh Gaur and Vivek Raghavan},
      year={2021},
      eprint={2107.07402},
      archivePrefix={arXiv},
      primaryClass={cs.CL}
}