nlpaug

June 12, 2026 · View on GitHub

nlpaug

This python library helps you with augmenting NLP, audio, and spectrogram data for machine learning projects. Visit this introduction to understand about Data Augmentation in NLP. Augmenter is the basic element of augmentation while Flow is a pipeline to orchestrate multiple augmenters together.

Features

Generate synthetic data for improving model performance without manual effort
Simple, easy-to-use and lightweight library. Augment data in a few lines of code
Plug and play with common machine learning and neural network frameworks
Support textual, audio, and spectrogram inputs
Python 3.12-ready V2 baseline with offline-first tests and GitHub Actions coverage

Textual Data Augmentation Example

Acoustic Data Augmentation Example

Section	Description
Quick Demo	How to use this library
Augmenter	Introduce all available augmentation methods
Installation	How to install this library
Recent Changes	Latest enhancement
Extension Reading	More real life examples or researchs
Reference	Reference of external resources such as data or model

Quick Demo

Quick Example
Example of Augmentation for Textual Inputs
Example of Augmentation for Multilingual Textual Inputs
Example of Augmentation for Spectrogram Inputs
Example of Augmentation for Audio Inputs
Example of Orchestra Multiple Augmenters
Example of Showing Augmentation History
How to train TF-IDF model
How to train LAMBADA model
How to create custom augmentation
API Documentation

Augmenter

Augmenter	Target	Augmenter	Action	Description
Textual	Character	KeyboardAug	substitute	Simulate keyboard distance error
Textual		OcrAug	substitute	Simulate OCR engine error
Textual		RandomAug	insert, substitute, swap, delete	Apply augmentation randomly
Textual	Word	AntonymAug	substitute	Substitute opposite meaning word according to WordNet antonym
Textual		ContextualWordEmbsAug	insert, substitute	Feeding surroundings word to BERT, DistilBERT, RoBERTa or XLNet language model to find out the most suitlabe word for augmentation
Textual		RandomWordAug	swap, crop, delete	Apply augmentation randomly
Textual		SpellingAug	substitute	Substitute word according to spelling mistake dictionary
Textual		SplitAug	split	Split one word to two words randomly
Textual		SynonymAug	substitute	Substitute similar word according to WordNet/ PPDB synonym
Textual		TfIdfAug	insert, substitute	Use TF-IDF to find out how word should be augmented
Textual		WordEmbsAug	insert, substitute	Leverage word2vec, GloVe or fasttext embeddings to apply augmentation
Textual		BackTranslationAug	substitute	Leverage two translation models for augmentation
Textual		ReservedAug	substitute	Replace reserved words
Textual	Sentence	ContextualWordEmbsForSentenceAug	insert	Insert sentence according to XLNet, GPT2 or DistilGPT2 prediction
Textual		AbstSummAug	substitute	Summarize article by abstractive summarization method
Textual		LambadaAug	substitute	Using language model to generate text and then using classification model to retain high quality results
Signal	Audio	CropAug	delete	Delete audio's segment
Signal		LoudnessAug	substitute	Adjust audio's volume
Signal		MaskAug	substitute	Mask audio's segment
Signal		NoiseAug	substitute	Inject noise
Signal		PitchAug	substitute	Adjust audio's pitch
Signal		ShiftAug	substitute	Shift time dimension forward/ backward
Signal		SpeedAug	substitute	Adjust audio's speed
Signal		VtlpAug	substitute	Change vocal tract
Signal		NormalizeAug	substitute	Normalize audio
Signal		PolarityInverseAug	substitute	Swap positive and negative for audio
Signal	Spectrogram	FrequencyMaskingAug	substitute	Set block of values to zero according to frequency dimension
Signal		TimeMaskingAug	substitute	Set block of values to zero according to time dimension
Signal		LoudnessAug	substitute	Adjust volume

Flow

Augmenter	Augmenter	Description
Pipeline	Sequential	Apply list of augmentation functions sequentially
Pipeline	Sometimes	Apply some augmentation functions randomly

Installation

The library targets Python 3.12+.

Install the core package:

pip install nlpaug

Install feature extras as needed:

pip install "nlpaug[transformers]"
pip install "nlpaug[nltk]"
pip install "nlpaug[word-embs]"
pip install "nlpaug[audio]"
pip install "nlpaug[lambada]"

Install the latest GitHub version:

pip install "git+https://github.com/makcedward/nlpaug.git"

If you use WordEmbsAug (word2vec, glove or fasttext), download the pretrained assets first:

from nlpaug.util.file.download import DownloadUtil
DownloadUtil.download_word2vec(dest_dir='.')
DownloadUtil.download_glove(model_name='glove.6B', dest_dir='.')
DownloadUtil.download_fasttext(model_name='wiki-news-300d-1M', dest_dir='.')

If you use SynonymAug with PPDB, download the language pack from:

http://paraphrase.org/#/download

Testing

Run the default offline suite:

python -m pytest

Run the optional integration suite:

python -m pytest -m integration

uv workflow

Create a Python 3.12 environment and run the default suite:

make test

Install the common optional extras and run the fuller local suite:

make test-full

Install the heaviest optional extras and run only integration tests:

make test-integration

If you prefer direct uv commands:

uv venv --python 3.12
uv pip install -p .venv/bin/python -e ".[dev]"
uv run --python .venv/bin/python pytest

Or use the repo scripts directly:

./scripts/setup_uv.sh core
./scripts/test_uv.sh core
./scripts/setup_uv.sh full
./scripts/test_uv.sh full
./scripts/setup_uv.sh integration
./scripts/test_uv.sh integration

Recent Changes

2.0.0 Jun 2026

Upgrade runtime baseline to Python 3.12+
Refresh major optional dependencies, including transformers 5.9, gensim 4.4, librosa 0.11, and NumPy 2.x
Add uv-based setup and test scripts for core, full, and integration workflows
Modernize offline-first tests so the default suite runs without downloading real models
Mock transformer-backed augmenters in tests and add broader regression coverage
Add GitHub Actions coverage reporting and local coverage scripts
Refactor shared augmenter hot paths and sentence generation internals for better readability and performance

1.1.11 Jul 6, 2022

See changelog for more details.

Extension Reading

Reference

This library uses data (e.g. capturing from internet), research (e.g. following augmenter idea), model (e.g. using pre-trained model) See data source for more details.

Citation

@misc{ma2019nlpaug,
  title={NLP Augmentation},
  author={Edward Ma},
  howpublished={https://github.com/makcedward/nlpaug},
  year={2019}
}

This package is cited by many books, workshop and academic research papers (70+). Here are some of examples and you may visit here to get the full list.

Contributions

_{sakares saengkaew}

_{Binoy Dalal}

_{Emrecan Çelik}