L3Cube-MahaNLP

June 7, 2025 ยท View on GitHub

Despite being the third most popular language in India, the Marathi language lacks useful NLP resources. With L3Cube-MahaNLP, we aim to build resources and a library for Marathi natural language processing. We have contributed un-supervised, supervised datasets, and Transformer models for Marathi. The supervised datasets include Marathi sentiment analysis, named entity recognition, and hate speech detection. With this, we at L3Cube-Pune aim to bring Marathi to the forefront of IndicNLP. Our vision is to make Marathi a resource-rich language and promote AI for Maharashtra!

[Update] The library is now available in a python package:

pip install mahaNLP

Usage examples are provided in this demo Colab .

[Update] We have released a new code-mixed Marathi-English unsupervised dataset MeCorpus and supervised datasets like MeSent, MeHate, and MeLID.
[Update] We have released a new multi-domain Sentiment analysis dataset MahaSent-MD with 60k samples across four diverse domains. A new sentiment analysis model is also released on HF.

L3Cube-MahaCorpus and Marathi BERT

L3Cube-MahaCorpus is a Marathi monolingual data set scraped from different internet sources. We expand the existing Marathi monolingual corpus with 24.8M sentences and 289M tokens. We also present, MahaBERT, MahaAlBERT, and MahaRoBerta all BERT-based masked language models, and MahaFT, the fast text word embeddings both trained on full Marathi corpus with 752M tokens. The evaluation details are mentioned in our paper link

Dataset Statistics

L3Cube-MahaCorpus(full) = L3Cube-MahaCorpus(news) + L3Cube-MahaCorpus(non-news)

Full Marathi Corpus incorporates all existing sources .

Dataset#tokens(M)#sentences(M)Link
L3Cube-MahaCorpus (news)21217.6 link
L3Cube-MahaCorpus (non-news)76.47.2 link
L3Cube-MahaCorpus (full)28924.8 link
Full Marathi Corpus (all sources)75257.2 link

L3Cube-MeCorpus and code-mixed MeBERT

L3Cube-MeCorpus is a first-of-its-kind large code-mixed Marathi-English (Mr-En) corpus with 10 million social media sentences released in paper .

Dataset#tokens(M)#sentences(M)Link
L3Cube-MeCorpus (Roman)70.95 link
L3Cube-MeCorpus (Devanagari)68.65 link
L3Cube-MeCorpus (Roman + Devanagari)139.510 link

Marathi BERT models and Marathi Fast Text model

The full Marathi Corpus is used to train BERT language models and made available on Hugging Face model hub.

ModelDescriptionLink
MahaGemma-7BGemma-7B v1
MahaGemma-2BGemma-2B v1
MahaBERTBase-BERT v1 , v2 , paper
MahaRoBERTaRoBERTa link
MahaAlBERTAlBERT v1 , v2
MahaGPTGPT2 link
MahaFTFast Text bin , vec
MahaTweetBERTMahaBERT + Tweets model , paper
MahaSBERTSentence-BERT MahaSBERT-STS , MahaSBERT , paper
IndicSBERTSentence-BERT (for cross-language) IndicSBERT-STS , IndicSBERT , paper
MeBERTCodemixed Marathi-English BERT (Roman) me-bert , paper
MeRoBERTaCodemixed Marathi-English RoBERTa (Roman) me-roberta , paper
MeBERT-MixedCodemixed Marathi-English BERT (Roman + Devanagari) me-bert-mixed , me-bert-mixed-v2 , paper
MeRoBERTa-MixedCodemixed Marathi-English RoBERTa (Roman + Devanagari) me-roberta-mixed , paper

Supervised Datasets

DatasetDescriptionSamples(train, valid, test)linkmodelpaper
MahaSQuADMarathi Question Answering Dataset142k (118516, 11873, 11803) data MahaSQuAD-BERT link
MahaNewsMarathi long, medium, and short document classification dataset in Marathi dataset with 12 target classes53k (42k, 5k, 5k) data MahaNews-All-BERT link
MahaNERMarathi Named Entity Recognition dataset with 8 entity classes25k (21.5k, 1.5k, 2k) data MahaNER-BERT link
MahaSocialNERSocial media based Marathi Named Entity Recognition dataset with 8 entity classes18k (12k, 1.5k, 2.2k) data MahaSocialNER-BERT link
MahaHateMarathi Hate Speech Detection dataset with 4 class (hate, offensive, pofane, and not) and 2 class (hate and not) labels4-class: 25k (21.5k, 1.5k, 2k), 2-class: 37500 data 4-class , 2-class link
MahaSentMarathi Sentiment Analysis dataset with three classes - Positive(1), Negative(-1) and Neutral(0)18,378 (12114, 1500, 2250); extra(2,514=2355(+1) + 159(-1)) data MarathiSentiment link
HateEval-MrAnother dataset for evaluation of Hate Speech models with two classes - Hate(1) and None(0)2k samples data link
MahaSent-MDA Multi-domain Marathi Sentiment Analysis dataset (4 domains - Marathi Movie Reviews, TV Subtitles, Generic Tweets, and Political Tweets) with three classes - Positive(1), Negative(-1) and Neutral(0)60k samples dataMahaSent-MD link
MeSentA code-mixed Marathi-English Sentiment Analysis dataset with three classes - Positive(1), Negative(-1) and Neutral(0)12k samples datame-sent-roberta link
MeHateA code-mixed Marathi-English Hate speech identification dataset with two classes - Hate(1) and None(0)2768 samples datame-hate-bert link
MeLIDA code-mixed Marathi-English language identification (LID) dataset with three classes - Marathi, English, and Undefined12k samples datame-lid-bert link

License

L3Cube-MahaCorpus, L3Cube-MahaNER, L3Cube-MahaHate, L3Cube-HateEval-Mr, L3Cube-MahaSent-MD, L3CubeMahaSent, L3Cube-MeCorpus, L3Cube-MahaSent-MD, L3Cube-MeSent, L3Cube-MeHate, and L3Cube-MeLID are licensed under a Creative Commons Attribution-NonCommercial-ShareAlike 4.0 International License. The datasets are released to the community for research purposes only and the group is not responsible for any misuse of these datasets.

Citing

@article{joshi2022l3cube,
  title={L3Cube-MahaNLP: Marathi Natural Language Processing Datasets, Models, and Library},
  author={Joshi, Raviraj},
  journal={arXiv preprint arXiv:2205.14728},
  year={2022}
}
@inproceedings{joshi-2022-l3cube,
    title = "L3Cube-MahaCorpus and MahaBERT: Marathi Monolingual Corpus, Marathi BERT Language Models, and Resources",
    author = "Joshi, Raviraj",
    booktitle = "Proceedings of the WILDRE-6 Workshop within the 13th Language Resources and Evaluation Conference",
    month = jun,
    year = "2022",
    address = "Marseille, France",
    publisher = "European Language Resources Association",
    url = "https://aclanthology.org/2022.wildre-1.17",
    pages = "97--101",
}

Publications

Joshi, Raviraj. "L3Cube-MahaCorpus and MahaBERT: Marathi Monolingual Corpus, Marathi BERT Language Models, and Resources." LREC 2022 Workshop Language Resources and Evaluation Conference 20-25 June 2022. 2022.

Shirke, Mayur, et al. "On Importance of Layer Pruning for Smaller BERT Models and Low Resource Languages." arXiv preprint arXiv:2501.00733 (2025).

Jadhav, Suramya, et al. "On Limitations of LLM as Annotator for Low Resource Languages." arXiv preprint arXiv:2411.17637 (2024).

Shanbhag, Abhay, et al. "Non-Contextual BERT or FastText? A Comparative Analysis" arXiv preprint arXiv:2411.17661 (2024).

Kowtal, Nidhi, and Raviraj Joshi. "L3Cube-MahaEmotions: A Marathi Emotion Recognition Dataset with Synthetic Annotations using CoTR prompting and Large Language Models." arXiv preprint arXiv:2506.00863 (2025).

Mittal, Saloni, et al. "L3Cube-MahaNews: News-Based Short Text and Long Document Classification Datasets in Marathi." International Conference on Speech and Language Technologies for Low-resource Languages. Cham: Springer Nature Switzerland, 2023.

Chavan, Tanmay, et al. "My Boli: Code-mixed Marathi-English Corpora, Pretrained Language Models and Evaluation Benchmarks." arXiv preprint arXiv:2306.14030 (2023).

Pingle, Aabha, et al. "L3Cube-MahaSent-MD: A Multi-domain Marathi Sentiment Analysis Dataset and Transformer Models." arXiv preprint arXiv:2306.13888 (2023).

Pingle, Aabha, et al. "Robust Sentiment Analysis for Low Resource languages Using Data Augmentation Approaches: A Case Study in Marathi." arXiv preprint arXiv:2310.00734 (2023).

Deode, Samruddhi, et al. "L3Cube-IndicSBERT: A simple approach for learning cross-lingual sentence representations using multilingual BERT." arXiv preprint arXiv:2304.11434 (2023).

Joshi, Ananya, et al. "L3Cube-MahaSBERT and HindSBERT: Sentence BERT Models and Benchmarking BERT Sentence Representations for Hindi and Marathi." arXiv preprint arXiv:2211.11187 (2022).

Gokhale, Omkar Bhushan, et al. "Spread Love Not Hate: Undermining the Importance of Hateful Pre-training for Hate Speech Detection." I Can't Believe It's Not Better Workshop: Understanding Deep Learning Through Empirical Falsification.

Sabane, Maithili, et al. "Enhancing Low Resource NER using Assisting Language and Transfer Learning." 2023 2nd International Conference on Applied Artificial Intelligence and Computing (ICAAIC). IEEE, 2023.

Litake, Onkar, et al. "L3Cube-MahaNER: A Marathi Named Entity Recognition Dataset and BERT models." Proceedings of the WILDRE-6 Workshop within the 13th Language Resources and Evaluation Conference. 2022.

Litake, Onkar, et al. "Mono vs Multilingual BERT: A Case Study in Hindi and Marathi Named Entity Recognition." arXiv preprint arXiv:2203.12907 (2022).

Velankar, Abhishek, Hrushikesh Patil, and Raviraj Joshi. "Mono vs multilingual bert for hate speech detection and text classification: A case study in marathi." IAPR Workshop on Artificial Neural Networks in Pattern Recognition. Springer, Cham, 2023.

Patil, Hrushikesh, Abhishek Velankar, and Raviraj Joshi. "L3Cube-MahaHate: A Tweet-based Marathi Hate Speech Detection Dataset and BERT Models." Proceedings of the Third Workshop on Threat, Aggression and Cyberbullying (TRAC 2022). 2022.

Velankar, Abhishek, et al. "Hate and offensive speech detection in Hindi and Marathi." arXiv preprint arXiv:2110.12200 (2021).

Kulkarni, Atharva, et al. "L3CubeMahaSent: A Marathi Tweet-based Sentiment Analysis Dataset." Proceedings of the Eleventh Workshop on Computational Approaches to Subjectivity, Sentiment and Social Media Analysis. 2021.

Kulkarni, Atharva, et al. "Experimental Evaluation of Deep Learning Models for Marathi Text Classification." Proceedings of the 2nd International Conference on Recent Trends in Machine Learning, IoT, Smart Cities and Applications. Springer, Singapore, 2022.

This project is led by Raviraj Joshi under L3Cube Labs, Pune. For any queries contact ravirajoshi@gmail.com .