nlpaug
June 12, 2026 · View on GitHub
nlpaug
This python library helps you with augmenting NLP, audio, and spectrogram data for machine learning projects. Visit this introduction to understand about Data Augmentation in NLP. Augmenter is the basic element of augmentation while Flow is a pipeline to orchestrate multiple augmenters together.
Features
- Generate synthetic data for improving model performance without manual effort
- Simple, easy-to-use and lightweight library. Augment data in a few lines of code
- Plug and play with common machine learning and neural network frameworks
- Support textual, audio, and spectrogram inputs
- Python 3.12-ready V2 baseline with offline-first tests and GitHub Actions coverage
Textual Data Augmentation Example

Acoustic Data Augmentation Example

| Section | Description |
|---|---|
| Quick Demo | How to use this library |
| Augmenter | Introduce all available augmentation methods |
| Installation | How to install this library |
| Recent Changes | Latest enhancement |
| Extension Reading | More real life examples or researchs |
| Reference | Reference of external resources such as data or model |
Quick Demo
- Quick Example
- Example of Augmentation for Textual Inputs
- Example of Augmentation for Multilingual Textual Inputs
- Example of Augmentation for Spectrogram Inputs
- Example of Augmentation for Audio Inputs
- Example of Orchestra Multiple Augmenters
- Example of Showing Augmentation History
- How to train TF-IDF model
- How to train LAMBADA model
- How to create custom augmentation
- API Documentation
Augmenter
| Augmenter | Target | Augmenter | Action | Description |
|---|---|---|---|---|
| Textual | Character | KeyboardAug | substitute | Simulate keyboard distance error |
| Textual | OcrAug | substitute | Simulate OCR engine error | |
| Textual | RandomAug | insert, substitute, swap, delete | Apply augmentation randomly | |
| Textual | Word | AntonymAug | substitute | Substitute opposite meaning word according to WordNet antonym |
| Textual | ContextualWordEmbsAug | insert, substitute | Feeding surroundings word to BERT, DistilBERT, RoBERTa or XLNet language model to find out the most suitlabe word for augmentation | |
| Textual | RandomWordAug | swap, crop, delete | Apply augmentation randomly | |
| Textual | SpellingAug | substitute | Substitute word according to spelling mistake dictionary | |
| Textual | SplitAug | split | Split one word to two words randomly | |
| Textual | SynonymAug | substitute | Substitute similar word according to WordNet/ PPDB synonym | |
| Textual | TfIdfAug | insert, substitute | Use TF-IDF to find out how word should be augmented | |
| Textual | WordEmbsAug | insert, substitute | Leverage word2vec, GloVe or fasttext embeddings to apply augmentation | |
| Textual | BackTranslationAug | substitute | Leverage two translation models for augmentation | |
| Textual | ReservedAug | substitute | Replace reserved words | |
| Textual | Sentence | ContextualWordEmbsForSentenceAug | insert | Insert sentence according to XLNet, GPT2 or DistilGPT2 prediction |
| Textual | AbstSummAug | substitute | Summarize article by abstractive summarization method | |
| Textual | LambadaAug | substitute | Using language model to generate text and then using classification model to retain high quality results | |
| Signal | Audio | CropAug | delete | Delete audio's segment |
| Signal | LoudnessAug | substitute | Adjust audio's volume | |
| Signal | MaskAug | substitute | Mask audio's segment | |
| Signal | NoiseAug | substitute | Inject noise | |
| Signal | PitchAug | substitute | Adjust audio's pitch | |
| Signal | ShiftAug | substitute | Shift time dimension forward/ backward | |
| Signal | SpeedAug | substitute | Adjust audio's speed | |
| Signal | VtlpAug | substitute | Change vocal tract | |
| Signal | NormalizeAug | substitute | Normalize audio | |
| Signal | PolarityInverseAug | substitute | Swap positive and negative for audio | |
| Signal | Spectrogram | FrequencyMaskingAug | substitute | Set block of values to zero according to frequency dimension |
| Signal | TimeMaskingAug | substitute | Set block of values to zero according to time dimension | |
| Signal | LoudnessAug | substitute | Adjust volume |
Flow
| Augmenter | Augmenter | Description |
|---|---|---|
| Pipeline | Sequential | Apply list of augmentation functions sequentially |
| Pipeline | Sometimes | Apply some augmentation functions randomly |
Installation
The library targets Python 3.12+.
Install the core package:
pip install nlpaug
Install feature extras as needed:
pip install "nlpaug[transformers]"
pip install "nlpaug[nltk]"
pip install "nlpaug[word-embs]"
pip install "nlpaug[audio]"
pip install "nlpaug[lambada]"
Install the latest GitHub version:
pip install "git+https://github.com/makcedward/nlpaug.git"
If you use WordEmbsAug (word2vec, glove or fasttext), download the pretrained assets first:
from nlpaug.util.file.download import DownloadUtil
DownloadUtil.download_word2vec(dest_dir='.')
DownloadUtil.download_glove(model_name='glove.6B', dest_dir='.')
DownloadUtil.download_fasttext(model_name='wiki-news-300d-1M', dest_dir='.')
If you use SynonymAug with PPDB, download the language pack from:
http://paraphrase.org/#/download
Testing
Run the default offline suite:
python -m pytest
Run the optional integration suite:
python -m pytest -m integration
uv workflow
Create a Python 3.12 environment and run the default suite:
make test
Install the common optional extras and run the fuller local suite:
make test-full
Install the heaviest optional extras and run only integration tests:
make test-integration
If you prefer direct uv commands:
uv venv --python 3.12
uv pip install -p .venv/bin/python -e ".[dev]"
uv run --python .venv/bin/python pytest
Or use the repo scripts directly:
./scripts/setup_uv.sh core
./scripts/test_uv.sh core
./scripts/setup_uv.sh full
./scripts/test_uv.sh full
./scripts/setup_uv.sh integration
./scripts/test_uv.sh integration
Recent Changes
2.0.0 Jun 2026
- Upgrade runtime baseline to Python 3.12+
- Refresh major optional dependencies, including transformers 5.9, gensim 4.4, librosa 0.11, and NumPy 2.x
- Add
uv-based setup and test scripts for core, full, and integration workflows - Modernize offline-first tests so the default suite runs without downloading real models
- Mock transformer-backed augmenters in tests and add broader regression coverage
- Add GitHub Actions coverage reporting and local coverage scripts
- Refactor shared augmenter hot paths and sentence generation internals for better readability and performance
1.1.11 Jul 6, 2022
- Return list of output
- Fix download util
- Fix lambda label misalignment
- Add language pack reference link for SynonymAug
See changelog for more details.
Extension Reading
- Data Augmentation library for Text
- Does your NLP model able to prevent adversarial attack?
- How does Data Noising Help to Improve your NLP Model?
- Data Augmentation library for Speech Recognition
- Data Augmentation library for Audio
- Unsupervied Data Augmentation
- A Visual Survey of Data Augmentation in NLP
Reference
This library uses data (e.g. capturing from internet), research (e.g. following augmenter idea), model (e.g. using pre-trained model) See data source for more details.
Citation
@misc{ma2019nlpaug,
title={NLP Augmentation},
author={Edward Ma},
howpublished={https://github.com/makcedward/nlpaug},
year={2019}
}
This package is cited by many books, workshop and academic research papers (70+). Here are some of examples and you may visit here to get the full list.
Workshops cited nlpaug
- S. Vajjala. NLP without a readymade labeled dataset at Toronto Machine Learning Summit, 2021. 2021
Book cited nlpaug
- S. Vajjala, B. Majumder, A. Gupta and H. Surana. Practical Natural Language Processing: A Comprehensive Guide to Building Real-World NLP Systems. 2020
- A. Bartoli and A. Fusiello. Computer Vision–ECCV 2020 Workshops. 2020
- L. Werra, L. Tunstall, and T. Wolf Natural Language Processing with Transformers. 2022
Research paper cited nlpaug
- Google: M. Raghu and E. Schmidt. A Survey of Deep Learning for Scientific Discovery. 2020
- Sirius XM: E. Jing, K. Schneck, D. Egan and S. A. Waterman. Identifying Introductions in Podcast Episodes from Automatically Generated Transcripts. 2021
- Salesforce Research: B. Newman, P. K. Choubey and N. Rajani. P-adapters: Robustly Extracting Factual Information from Language Modesl with Diverse Prompts. 2021
- Salesforce Research: L. Xue, M. Gao, Z. Chen, C. Xiong and R. Xu. Robustness Evaluation of Transformer-based Form Field Extractors via Form Attacks. 2021
Contributions
sakares saengkaew |
Binoy Dalal |
Emrecan Çelik |