README.md
June 1, 2026 · View on GitHub

This pandect (πανδέκτης is Ancient Greek for encyclopedia) was created to help you find almost anything related to Natural Language Processing that is available online.
Note Quick legend on available resource types:
⭐ - open source project, usually a GitHub repository with its number of stars
📙 - resource you can read, usually a blog post or a paper
🗂️ - a collection of additional resources
🔱 - non-open source tool, framework or paid service
🎥️ - a resource you can watch
🎙️ - a resource you can listen to
Table of Contents
Note Section keywords: paper summaries, compendium, awesome list
Compendiums and awesome lists on the topic of NLP:
- 🗂️ The NLP Index - Searchable Index of NLP Papers by Quantum Stat / NLP Cypher
- ⭐ Awesome NLP by keon [GitHub, 18674 stars]
- ⭐ Speech and Natural Language Processing Awesome List by elaboshira [GitHub, 2224 stars]
- ⭐ Awesome Deep Learning for Natural Language Processing (NLP) [GitHub, 1307 stars]
- ⭐ Text Mining and Natural Language Processing Resources by stepthom [GitHub, 598 stars]
- 🗂️ Brainsources for #NLP enthusiasts by Philip Vollet
- ⭐ Awesome AI/ML/DL - NLP Section [GitHub, 1668 stars]
- 🗂️ NLP articles by Devopedia
- ⭐ Awesome LLM Apps [GitHub, 112502 stars]
NLP Conferences, Paper Summaries and Paper Compendiums:
Papers and Paper Summaries
- ⭐ 100 Must-Read NLP Papers 100 Must-Read NLP Papers [GitHub, 3846 stars]
- ⭐ NLP Paper Summaries by dair-ai [GitHub, 1477 stars]
- ⭐ Curated collection of papers for the NLP practitioner [GitHub, 1072 stars]
- ⭐ Papers on Textual Adversarial Attack and Defense [GitHub, 1574 stars]
- ⭐ Recent Deep Learning papers in NLU and RL by Valentin Malykh [GitHub, 297 stars]
- ⭐ A Survey of Surveys (NLP & ML): Collection of NLP Survey Papers [GitHub, 2031 stars]
- ⭐ A Paper List for Style Transfer in Text [GitHub, 1623 stars]
- 🎥 Video recordings index for papers
Conference Summaries
- ⭐ NLP top 10 conferences Compendium by soulbliss [GitHub, 459 stars]
- 📙 ICLR 2020 Trends
- 📙 SpacyIRL 2019 Conference in Overview
- 📙 Paper Digest - Conferences and Papers in Overview
NLP Progress and NLP Tasks:
- ⭐ NLP Progress by sebastianruder [GitHub, 22957 stars]
- ⭐ NLP Tasks by Kyubyong [GitHub, 3013 stars]
NLP Datasets:
- ⭐ NLP Datasets by niderhoff [GitHub, 5982 stars]
- ⭐ Datasets by Huggingface [GitHub, 21559 stars]
- 🗂️ Big Bad NLP Database
- ⭐ UWA Unambiguous Word Annotations - Word Sense Disambiguation Dataset
- ⭐ MLDoc - Corpus for Multilingual Document Classification in Eight Language [GitHub, 153 stars]
Word and Sentence embeddings:
- ⭐ Awesome Embedding Models by Hironsan [GitHub, 1840 stars]
- ⭐ Awesome list of Sentence Embeddings by Separius [GitHub, 2289 stars]
- ⭐ Awesome BERT by Jiakui [GitHub, 1842 stars]
- ⭐ FlagEmbedding - Retrieval and Retrieval-augmented LLMs [GitHub, 11753 stars]
Notebooks, Scripts and Repositories
- ⭐ The Super Duper NLP Repo [Website, 2020]
Non-English resources and Compendiums
- ⭐ NLP Resources for Bahasa Indonesian [GitHub, 572 stars]
- ⭐ Indic NLP Catalog [GitHub, 632 stars]
- ⭐ Pre-trained language models for Vietnamese [GitHub, 788 stars]
- ⭐ Natural Language Toolkit for Indic Languages (iNLTK) [GitHub, 840 stars]
- ⭐ Indic NLP Library [GitHub, 638 stars]
- ⭐ AI4Bharat-IndicNLP Portal
- ⭐ ARBML - Implementation of many Arabic NLP and ML projects [GitHub, 421 stars]
- ⭐ zemberek-nlp - NLP tools for Turkish [GitHub, 1331 stars]
- ⭐ TDD AI - An open-source platform for all Turkish datasets, language models, and NLP tools.
- ⭐ KLUE - Korean Language Understanding Evaluation [GitHub, 595 stars]
- ⭐ Persian NLP Benchmark - benchmark for evaluation and comparison of various NLP tasks in Persian language [GitHub, 76 stars]
- ⭐ nlp-greek - Greek language sources [GitHub, 5 stars]
- ⭐ Awesome NLP Resources for Hungarian [GitHub, 278 stars]
Pre-trained NLP models
- ⭐ List of pre-trained NLP models [GitHub, 170 stars]
- ⭐ Pretrained language models developed by Huawei Noah's Ark Lab [GitHub, 3160 stars]
- ⭐ Spanish Language Models and resources [GitHub, 263 stars]
NLP History
General
- ⭐ Modern Deep Learning Techniques Applied to Natural Language Processing [GitHub, 1322 stars]
2020 Year in Review
- 📙 Natural Language Processing in 2020: The Year In Review [Blog, December 2020]
- 📙 ML and NLP Research Highlights of 2020 [Blog, January 2021]
🔙 Back to the Table of Contents
NLP-only podcasts
- 🎙️ NLP Highlights [Years: 2017 - now, Status: active]
- 🎙️ The NLP Zone [Years: 2021 - now, Status: active]
Many NLP episodes
- 🎙️ TWIML AI [Years: 2016 - now, Status: active]
- 🎙️ Practical AI [Years: 2018 - now, Status: active]
- 🎙️ The Data Exchange [Years: 2019 - now, Status: active]
- 🎙️ Gradient Dissent [Years: 2020 - now, Status: active]
- 🎙️ Machine Learning Street Talk [Years: 2020 - now, Status: active]
- 🎙️ DataFramed - latest trends and insights on how to scale the impact of data science in organizations [Years: 2019 - now, Status: active]
Some NLP episodes
- 🎙️ The Super Data Science Podcast [Years: 2016 - now, Status: active]
- 🎙️ Data Hack Radio [Years: 2018 - now, Status: active]
- 🎙️ AI Game Changers [Years: 2020, Status: active]
- 🎙️ The Analytics Show [Years: 2019 - now, Status: active]
- 📙 NLP News by Sebastian Ruder
- 📙 This Week in NLP by Robert Dale
- 📙 Papers with Code
- 📙 The Batch by deeplearning.ai
- 📙 Paper Digest by PaperDigest
- 📙 NLP Cypher by QuantumStat
- 🎥 NLP Zurich [YouTube Recordings]
- 🎥 Hacking-Machine-Learning [YouTube Recordings]
- 🎥 NY-NLP (New York)
- 🎥 Yannic Kilcher
- 🎥 HuggingFace
- 🎥 Kaggle Reading Group
- 🎥 Rasa Paper Reading
- 🎥 Stanford CS224N: NLP with Deep Learning
- 🎥 NLPxing
- 🎥 ML Explained - A.I. Socratic Circles - AISC
- 🎥 Deeplearning.ai
- 🎥 Machine Learning Street Talk
🔙 Back to the Table of Contents
General NLU
- ⭐ GLUE - General Language Understanding Evaluation (GLUE) benchmark
- ⭐ SuperGLUE - benchmark styled after GLUE with a new set of more difficult language understanding tasks
- ⭐ decaNLP - The Natural Language Decathlon (decaNLP) for studying general NLP models
- ⭐ dialoglue - DialoGLUE: A Natural Language Understanding Benchmark for Task-Oriented Dialogue [GitHub, 287 stars]
- ⭐ DynaBench - Dynabench is a research platform for dynamic data collection and benchmarking
- ⭐ Big-Bench - collaborative benchmark for measuring and extrapolating the capabilities of language models [GitHub, 3244 stars]
Summarization
- ⭐ WikiAsp - WikiAsp: Multi-document aspect-based summarization Dataset
- ⭐ WikiLingua - A Multilingual Abstractive Summarization Dataset
Question Answering
- ⭐ SQuAD - Stanford Question Answering Dataset (SQuAD)
- ⭐ XQuad - XQuAD (Cross-lingual Question Answering Dataset) for cross-lingual question answering
- ⭐ GrailQA - Strongly Generalizable Question Answering (GrailQA)
- ⭐ CSQA - Complex Sequential Question Answering
Multilingual and Non-English Benchmarks
- 📙 XTREME - Massively Multilingual Multi-task Benchmark
- ⭐ GLUECoS - A benchmark for code-switched NLP
- ⭐ IndicGLUE - Natural Language Understanding Benchmark for Indic Languages
- ⭐ LinCE - Linguistic Code-Switching Evaluation Benchmark
- ⭐ Russian SuperGlue - Russian SuperGlue Benchmark
Bio, Law, and other scientific domains
- ⭐ BLURB - Biomedical Language Understanding and Reasoning Benchmark
- ⭐ BLUE - Biomedical Language Understanding Evaluation benchmark
- ⭐ LexGLUE - A Benchmark Dataset for Legal Language Understanding in English
Transformer Efficiency
- ⭐ Long-Range Arena - Long Range Arena for Benchmarking Efficient Transformers (Pre-print) [GitHub, 788 stars]
Other
- ⭐ CodeXGLUE - A benchmark dataset for code intelligence
- ⭐ CrossNER - CrossNER: Evaluating Cross-Domain Named Entity Recognition
- ⭐ MultiNLI - Multi-Genre Natural Language Inference corpus
- ⭐ iSarcasm: A Dataset of Intended Sarcasm - iSarcasm is a dataset of tweets, each labelled as either sarcastic or non_sarcastic
- ⭐ SLTev - tool for comprehensive evaluation of (simultaneous) spoken language translation [GitHub, 12 stars]
🔙 Back to the Table of Contents
General
- 📙 A Recipe for Training Neural Networks by Andrej Karpathy [Keywords: research, training, 2019]
- 📙 Recent Advances in NLP via Large Pre-Trained Language Models: A Survey [Paper, November 2021]
Embeddings
Repositories
- ⭐ Pre-trained ELMo Representations for Many Languages [GitHub, 1461 stars]
- ⭐ sense2vec - Contextually-keyed word vectors [GitHub, 1673 stars]
- ⭐ wikipedia2vec [GitHub, 966 stars]
- ⭐ StarSpace [GitHub, 3955 stars]
- ⭐ fastText [GitHub, 26531 stars]
Blogs
- 📙 Language Models and Contextualised Word Embeddings by David S. Batista [Blog, 2018]
- 📙 An Essential Guide to Pretrained Word Embeddings for NLP Practitioners by AnalyticsVidhya [Blog, 2020]
- 📙 Polyglot Word Embeddings Discover Language Clusters [Blog, 2020]
- 📙 The Illustrated Word2vec by Jay Alammar [Blog, 2019]
Cross-lingual Word and Sentence Embeddings
- ⭐ vecmap - VecMap (cross-lingual word embedding mappings) [GitHub, 654 stars]
- ⭐ sentence-transformers - Multilingual Sentence & Image Embeddings with BERT [GitHub, 18765 stars]
Byte Pair Encoding
- ⭐ bpemb - Pre-trained subword embeddings in 275 languages, based on Byte-Pair Encoding (BPE) [GitHub, 1220 stars]
- ⭐ subword-nmt - Unsupervised Word Segmentation for Neural Machine Translation and Text Generation [GitHub, 2273 stars]
- ⭐ python-bpe - Byte Pair Encoding for Python [GitHub, 232 stars]
Transformer-based Architectures
General
- 📙 The Transformer Family by Lilian Weng [Blog, 2020]
- 📙 Playing the lottery with rewards and multiple languages - about the effect of random initialization [ICLR 2020 Paper]
- 📙 Attention? Attention! by Lilian Weng [Blog, 2018]
- 📙 the transformer … “explained”? [Blog, 2019]
- 🎥️ Attention is all you need; Attentional Neural Network Models by Łukasz Kaiser [Talk, 2017]
- 📙 Attention Is Off By One [July, 2023]
- 🎥️ Understanding and Applying Self-Attention for NLP [Talk, 2018]
- 📙 The NLP Cookbook: Modern Recipes for Transformer based Deep Learning Architectures [Paper, April 2021]
- 📙 Pre-Trained Models: Past, Present and Future [Paper, June 2021]
- 📙 A Survey of Transformers [Paper, June 2021]
Transformer
- 📙 The Annotated Transformer by Harvard NLP [Blog, 2018]
- 📙 The Illustrated Transformer by Jay Alammar [Blog, 2018]
- 📙 Illustrated Guide to Transformers by Hong Jing [Blog, 2020]
- 📙 Evolution of Representations in the Transformer by Lena Voita [Blog, 2019]
- 📙 Reformer: The Efficient Transformer [Blog, 2020]
- 📙 Longformer — The Long-Document Transformer by Viktor Karlsson [Blog, 2020]
- 📙 TRANSFORMERS FROM SCRATCH [Blog, 2019]
- 📙 Transformers in Natural Language Processing — A Brief Survey by George Ho [Blog, May 2020]
- ⭐ Lite Transformer - Lite Transformer with Long-Short Range Attention [GitHub, 611 stars]
- 📙 Transformers from Scratch [Blog, Oct 2021]
BERT
- 📙 A Visual Guide to Using BERT for the First Time by Jay Alammar [Blog, 2019]
- 📙 The Dark Secrets of BERT by Anna Rogers [Blog, 2020]
- 📙 Understanding searches better than ever before [Blog, 2019]
- 📙 Demystifying BERT: A Comprehensive Guide to the Groundbreaking NLP Framework [Blog, 2019]
- ⭐ SemBERT - Semantics-aware BERT for Language Understanding [GitHub, 288 stars]
- ⭐ BERTweet - BERTweet: A pre-trained language model for English Tweets [GitHub, 609 stars]
- ⭐ Optimal Subarchitecture Extraction for BERT [GitHub, 470 stars]
- ⭐ CharacterBERT: Reconciling ELMo and BERT [GitHub, 199 stars]
- 📙 When BERT Plays The Lottery, All Tickets Are Winning [Blog, Dec 2020]
- ⭐ BERT-related Papers a list of BERT-related papers [GitHub, 2036 stars]
Other Transformer Variants
T5
- 📙 T5 Understanding Transformer-Based Self-Supervised Architectures [Blog, August 2020]
- 📙 T5: the Text-To-Text Transfer Transformer [Blog, 2020]
- ⭐ multilingual-t5 - Multilingual T5 (mT5) is a massively multilingual pretrained text-to-text transformer model [GitHub, 1294 stars]
BigBird
- 📙 Big Bird: Transformers for Longer Sequences original paper by Google Research [Paper, July 2020]
Reformer / Linformer / Longformer / Performers
- 🎥️ Reformer: The Efficient Transformer - [Paper, February 2020] [Video, October 2020]
- 🎥️ Longformer: The Long-Document Transformer - [Paper, April 2020] [Video, April 2020]
- 🎥️ Linformer: Self-Attention with Linear Complexity - [Paper, June 2020] [Video, June 2020]
- 🎥️ Rethinking Attention with Performers - [Paper, September 2020] [Video, September 2020]
- ⭐ performer-pytorch - An implementation of Performer, a linear attention-based transformer, in Pytorch [GitHub, 1176 stars]
Switch Transformer
- 📙 Switch Transformers: Scaling to Trillion Parameter Models original paper by Google Research [Paper, January 2021]
GPT-family
General
- 📙 The Illustrated GPT-2 by Jay Alammar [Blog, 2019]
- 📙 The Annotated GPT-2 by Aman Arora
- 📙 OpenAI’s GPT-2: the model, the hype, and the controversy by Ryan Lowe [Blog, 2019]
- 📙 How to generate text by Patrick von Platen [Blog, 2020]
GPT-3
Learning Resources
- 📙 Zero Shot Learning for Text Classification by Amit Chaudhary [Blog, 2020]
- 📙 GPT-3 A Brief Summary by Leo Gao [Blog, 2020]
- 📙 GPT-3, a Giant Step for Deep Learning And NLP by Yoel Zeldes [Blog, June 2020]
- 📙 GPT-3 Language Model: A Technical Overview by Chuan Li [Blog, June 2020]
- 📙 Is it possible for language models to achieve language understanding? by Christopher Potts
Applications
- ⭐ Awesome GPT-3 - list of all resources related to GPT-3 [GitHub, 4536 stars]
- 🗂️ GPT-3 Projects - a map of all GPT-3 start-ups and commercial projects
- 🗂️ GPT-3 Demo Showcase - GPT-3 Demo Showcase, 180+ Apps, Examples, & Resources
- 🔱 OpenAI API - API Demo to use OpenAI GPT for commercial applications
Open-source Efforts
- 📙 GPT-Neo - in-progress GPT-3 open source replication HuggingFace Hub
- ⭐ GPT-J - A 6 billion parameter, autoregressive text generation model trained on The Pile
- 📙 Effectively using GPT-J with few-shot learning [Blog, July 2021]
Other
- 📙 What is Two-Stream Self-Attention in XLNet by Xu LIANG [Blog, 2019]
- 📙 Visual Paper Summary: ALBERT (A Lite BERT) by Amit Chaudhary [Blog, 2020]
- 📙 Turing NLG by Microsoft
- 📙 Multi-Label Text Classification with XLNet by Josh Xin Jie Lee [Blog, 2019]
- ⭐ ELECTRA [GitHub, 2370 stars]
- ⭐ Performer implementation of Performer, a linear attention-based transformer, in Pytorch [GitHub, 1176 stars]
Distillation, Pruning and Quantization
Reading Material
- 📙 Compression of Deep Learning Models for Text: A Survey [Paper, April 2021]
Tools
- ⭐ Bert-squeeze - code to reduce the size of Transformer-based models or decrease their latency at inference time [GitHub, 85 stars]
- ⭐ XtremeDistil - XtremeDistilTransformers for Distilling Massive Multilingual Neural Networks [GitHub, 157 stars]
Automated Summarization
- 📙 PEGASUS: A State-of-the-Art Model for Abstractive Text Summarization by Google AI [Blog, June 2020]
- ⭐ CTRLsum - CTRLsum: Towards Generic Controllable Text Summarization [GitHub, 150 stars]
- ⭐ XL-Sum - XL-Sum: Large-Scale Multilingual Abstractive Summarization for 44 Languages [GitHub, 277 stars]
- ⭐ SummerTime - an open-source text summarization toolkit for non-experts [GitHub, 280 stars]
- ⭐ PRIMER - PRIMER: Pyramid-based Masked Sentence Pre-training for Multi-document Summarization [GitHub, 157 stars]
- ⭐ summarus - Models for automatic abstractive summarization [GitHub, 172 stars]
Knowledge Graphs and NLP
- 📙 Fusing Knowledge into Language Model [Presentation, Oct 2021]
Model Generation
- ⭐ smolmodels - agentic framework for building ML models from natural language
Small LLMs
- smollm - 3B parameter language model designed to push the boundaries of small models [GitHub, 3797 stars]
Note Section keywords: best practices, MLOps
🔙 Back to the Table of Contents
Best Practices for building NLP Projects
- 🎥 In Search of Best Practices for NLP Projects [Slides, Dec. 2020]
- 🎥 EMNLP 2020: High Performance Natural Language Processing by Google Research, Recording, Nov. 2020]
- 📙 Practical Natural Language Processing - A Comprehensive Guide to Building Real-World NLP Systems [Book, June 2020]
- 📙 How to Structure and Manage NLP Projects [Blog, May 2021]
- 📙 Applied NLP Thinking - Applied NLP Thinking: How to Translate Problems into Solutions [Blog, June 2021]
- 🎥 Introduction to NLP for Industry Use - DataTalksClub presentation on Introduction to NLP for Industry Use [Recording, December 2021]
- 📙 Measuring Embedding Drift - Best practices for monitoring drift of NLP models [Blog, December 2022]
- 📙 Drift in Machine Learning - How to Identify Issues Before You Have a Problem [Blog, January 2022]
MLOps for NLP
MLOps, especially when applied to NLP, is a set of best practices around automating various parts of the workflow when building and deploying NLP pipelines.
In general, MLOps for NLP includes having the following processes in place:
- Data Versioning - make sure your training, annotation and other types of data are versioned and tracked
- Experiment Tracking - make sure that all of your experiments are automatically tracked and saved where they can be easily replicated or retraced
- Model Registry - make sure any neural models you train are versioned and tracked and it is easy to roll back to any of them
- Automated Testing and Behavioral Testing - besides regular unit and integration tests, you want to have behavioral tests that check for bias or potential adversarial attacks
- Model Deployment and Serving - automate model deployment, ideally also with zero-downtime deploys like Blue/Green, Canary deploys etc.
- Data and Model Observability - track data drift, model accuracy drift etc.
Additionally, there are two more components that are not as prevalent for NLP and are mostly used for Computer Vision and other sub-fields of AI:
- Feature Store - centralized storage of all features developed for ML models than can be easily reused by any other ML project
- Metadata Management - storage for all information related to the usage of ML models, mainly for reproducing behavior of deployed ML models, artifact tracking etc.
MLOps Compilations & Awesome Lists
- ⭐ awesome-mlops [GitHub, 13923 stars]
- ⭐ best-of-ml-python [GitHub, 23609 stars]
Running LLMs locally or self-hosted
- ⭐ vLLM [GitHub, 81616 stars]
- ⭐ llama.cpp [GitHub, 114160 stars]
- 🔱 ollama [Free Local & Paid Cloud Service]
Reading Material
- 📙 Machine Learning Operations (MLOps): Overview, Definition, and Architecture [Paper, May 2022]
- 📙 Requirements and Reference Architecture for MLOps:Insights from Industry [Paper, Oct 2022]
- 📙 MLOps: What It Is, Why it Matters, and How To Implement It by Neptune AI [Blog, July 2021]
- 📙 Best MLOps Tools You Need to Know as a Data Scientist by Neptune AI [Blog, July 2021]
- 📙 State of MLOps 2021 by Valohai [Blog, August 2021]
- 📙 The MLOps Stack by Valohai [Blog, October 2020]
- 📙 The Rapid Evolution of the Canonical Stack for Machine Learning [Blog, July 2021]
- 📙 MLOps: Comprehensive Beginner’s Guide [Blog, March 2021]
- 📙 What I’ve learned about MLOps from speaking with 100+ ML practitioners [Blog, May 2021]
- 📙 DataRobot Challenger Models - MLOps Champion/Challenger Models
- 📙 State of MLOps Blog by Dr. Ori Cohen
- 📙 MLOps Ecosystem Overview [Blog, 2021]
- 📙 Metrics vs. Inferences - Which should you observe? [Blog, February 2024]
Learning Material
- 🗂 MLOps cource by Made With ML
- 🗂 GitHub MLOps - collection of resources on how to facilitate Machine Learning Ops with GitHub
- 🗂 ML Observability Fundamentals Course Learn how to monitor and root-cause issues with production NLP models
MLOps Communities
- The MLOps Community - blogs, slack group, newsletter and more all about MLOps
Data Versioning
- ⭐ DVC - Data Version Control (DVC) tracks ML models and data sets [Free and Open Source] Link to GitHub
- 🔱 Weights & Biases - tools for experiment tracking and dataset versioning [Paid Service]
Experiment Tracking
- ⭐ mlflow - open source platform for the machine learning lifecycle [Free and Open Source] Link to GitHub
- 🔱 Weights & Biases - tools for experiment tracking and dataset versioning [Paid Service]
- 🔱 Neptune AI - experiment tracking and model registry built for research and production teams [Paid Service]
- 🔱 Comet ML - enables data scientists and teams to track, compare, explain and optimize experiments and models [Paid Service]
- 🔱 SigOpt - automate training & tuning, visualize & compare runs [Paid Service]
- ⭐ Optuna - hyperparameter optimization framework [GitHub, 14280 stars]
- ⭐ Clear ML - experiment, orchestrate, deploy, and build data stores, all in one place [Free and Open Source] Link to GitHub
- ⭐ Metaflow - human-friendly Python/R library that helps scientists and engineers build and manage real-life data science projects [GitHub, 10111 stars]
Model Registry
- ⭐ DVC - Data Version Control (DVC) tracks ML models and data sets [Free and Open Source] Link to GitHub
- ⭐ mlflow - open source platform for the machine learning lifecycle [Free and Open Source] Link to GitHub
- ⭐ ModelDB - open-source system for Machine Learning model versioning, metadata, and experiment management [GitHub, 1747 stars]
- 🔱 Neptune AI - experiment tracking and model registry built for research and production teams [Paid Service]
- 🔱 Valohai - End-to-end ML pipelines [Paid Service]
- 🔱 Pachyderm - version control for data with the tools to build scalable end-to-end ML/AI pipelines [Paid Service with Free Tier]
- 🔱 polyaxon - reproduce, automate, and scale your data science workflows with production-grade MLOps tools [Paid Service]
- 🔱 Comet ML - enables data scientists and teams to track, compare, explain and optimize experiments and models [Paid Service]
Automated Testing and Behavioral Testing
- ⭐ CheckList - Beyond Accuracy: Behavioral Testing of NLP models [GitHub, 2050 stars]
- ⭐ TextAttack - framework for adversarial attacks, data augmentation, and model training in NLP [GitHub, 3427 stars]
- ⭐ WildNLP - Corrupt an input text to test NLP models' robustness [GitHub, 76 stars]
- ⭐ Great Expectations - Write tests for your data [GitHub, 11532 stars]
- ⭐ Deepchecks - Python package for comprehensively validating your machine learning models and data [GitHub, 4017 stars]
Model Deployability and Serving
- ⭐ mlflow - open source platform for the machine learning lifecycle [Free and Open Source] Link to GitHub
- 🔱 Amazon SageMaker [Paid Service]
- 🔱 Valohai - End-to-end ML pipelines [Paid Service]
- 🔱 NLP Cloud - Production-ready NLP API [Paid Service]
- 🔱 Saturn Cloud [Paid Service]
- 🔱 Comet ML - enables data scientists and teams to track, compare, explain and optimize experiments and models [Paid Service]
- 🔱 polyaxon - reproduce, automate, and scale your data science workflows with production-grade MLOps tools [Paid Service]
- ⭐ TorchServe - flexible and easy to use tool for serving PyTorch models [GitHub, 4359 stars]
- 🔱 Kubeflow - The Machine Learning Toolkit for Kubernetes [GitHub, 10600 stars]
- ⭐ KFServing - Serverless Inferencing on Kubernetes [GitHub, 5534 stars]
- 🔱 TFX - TensorFlow Extended - end-to-end platform for deploying production ML pipelines [Paid Service]
- 🔱 Pachyderm - version control for data with the tools to build scalable end-to-end ML/AI pipelines [Paid Service with Free Tier]
- 🔱 Cortex - containers as a service on AWS [Paid Service]
- 🔱 Azure Machine Learning - end-to-end machine learning lifecycle [Paid Service]
- ⭐ End2End Serverless Transformers On AWS Lambda [GitHub, 122 stars]
- ⭐ NLP-Service - sample demo of NLP as a service platform built using FastAPI and Hugging Face [GitHub, 13 stars]
- 🔱 Dagster - data orchestrator for machine learning [Free and Open Source]
- 🔱 Verta - AI and machine learning deployment and operations [Paid Service]
- ⭐ Metaflow - human-friendly Python/R library that helps scientists and engineers build and manage real-life data science projects [GitHub, 10111 stars]
- ⭐ flyte - workflow automation platform for complex, mission-critical data and ML processes at scale [GitHub, 7056 stars]
- ⭐ MLRun - Machine Learning automation and tracking [GitHub, 1670 stars]
- 🔱 DataRobot MLOps - DataRobot MLOps provides a center of excellence for your production AI
Model Debugging
- ⭐ imodels - package for concise, transparent, and accurate predictive modeling [GitHub, 1592 stars]
- ⭐ Cockpit - A Practical Debugging Tool for Training Deep Neural Networks [GitHub, 488 stars]
Model Accuracy Prediction
- ⭐ WeightWatcher - WeightWatcher tool for predicting the accuracy of Deep Neural Networks [GitHub, 1751 stars]
Data and Model Observability
General
- ⭐ Arize AI - embedding drift monitoring for NLP models
- ⭐ Arize-Phoenix - ML observability for LLMs, vision, language, and tabular models
- ⭐ whylogs - open source standard for data and ML logging [GitHub, 2819 stars]
- ⭐ Rubrix - open-source tool for exploring and iterating on data for artificial intelligence projects [GitHub, 4992 stars]
- ⭐ MLRun - Machine Learning automation and tracking [GitHub, 1670 stars]
- 🔱 DataRobot MLOps - DataRobot MLOps provides a center of excellence for your production AI
- 🔱 Cortex - containers as a service on AWS [Paid Service]
Model Centric
- 🔱 Algorithmia - minimize risk with advanced reporting and enterprise-grade security and governance across all data, models, and infrastructure [Paid Service]
- 🔱 Dataiku - dataiku is for teams who want to deliver advanced analytics using the latest techniques at big data scale [Paid Service]
- ⭐ Evidently AI - tools to analyze and monitor machine learning models [Free and Open Source] Link to GitHub
- 🔱 Fiddler - All-in-one ML and LLM observability. Fastest LLM Guardrails. [Paid Service]
- 🔱 Hydrosphere - open-source platform for managing ML models [Paid Service]
- 🔱 Verta - AI and machine learning deployment and operations [Paid Service]
- 🔱 Domino Model Ops - Deploy and Manage Models to Drive Business Impact [Paid Service]
Data Centric
- 🔱 Datafold - data quality through diffs, profiling, and anomaly detection [Paid Service]
- 🔱 acceldata - improve reliability, accelerate scale, and reduce costs across all data pipelines [Paid Service]
- 🔱 Bigeye - monitoring and alerting to your datasets in minutes [Paid Service]
- 🔱 datakin - end-to-end, real-time data lineage solution [Paid Service]
- 🔱 Monte Carlo - data integrity, drifts, schema, lineage [Paid Service]
- 🔱 SODA - data monitoring, testing and validation [Paid Service]
Feature Stores
- 🔱 Tecton - enterprise feature store for machine learning [Paid Service]
- ⭐ FEAST - open source feature store for machine learning Website [GitHub, 7063 stars]
- 🔱 Hopsworks Feature Store - data management system for managing machine learning features [Paid Service]
Metadata Management
- ⭐ ML Metadata - a library for recording and retrieving metadata associated with ML developer and data scientist workflows [GitHub, 678 stars]
- 🔱 Neptune AI - experiment tracking and model registry built for research and production teams [Paid Service]
MLOps Frameworks
- ⭐ Metaflow - human-friendly Python/R library that helps scientists and engineers build and manage real-life data science projects [GitHub, 10111 stars]
- ⭐ kedro - Python framework for creating reproducible, maintainable and modular data science code [GitHub, 10867 stars]
- ⭐ Seldon Core - MLOps framework to package, deploy, monitor and manage thousands of production machine learning models [GitHub, 4752 stars]
- ⭐ ZenML - MLOps framework to create reproducible ML pipelines for production machine learning [GitHub, 5429 stars]
- 🔱 Google Vertex AI - build, deploy, and scale ML models faster, with pre-trained and custom tooling within a unified AI platform [Paid Service]
- ⭐ Diffgram - Complete training data platform for machine learning delivered as a single application [GitHub, 1904 stars]
- 🔱 Continual.ai - build, deploy, and operationalize ML models easier and faster with a declarative interface on cloud data warehouses like Snowflake, BigQuery, RedShift, and Databricks. [Paid Service]
Transformer-based Architectures
🔙 Back to the Table of Contents
General
- 📙 Why BERT Fails in Commercial Environments by Intel AI [Blog, 2020]
- 📙 Fine Tuning BERT for Text Classification with FARM by Sebastian Guggisberg [Blog, 2020]
- ⭐ Pretrain Transformers Models in PyTorch using Hugging Face Transformers [GitHub, 265 stars]
- 🎥️ Practical NLP for the Real World [Presentation, 2019]
- 🎥️ From Paper to Product – How we implemented BERT by Christoph Henkelmann [Talk, 2020]
Multi-GPU Transformers
- ⭐ Parallelformers: An Efficient Model Parallelization Toolkit for Deployment [GitHub, 788 stars]
Training Transformers Effectively
- ⭐ Training BERT with Compute/Time (Academic) Budget [GitHub, 315 stars]
Embeddings as a Service
- ⭐ embedding-as-service [GitHub, 210 stars]
- ⭐ Bert-as-service [GitHub, 12830 stars]
NLP Recipes Industrial Applications:
- ⭐ NLP Recipes by microsoft [GitHub, 6438 stars]
- ⭐ NLP with Python by susanli2016 [GitHub, 2790 stars]
- ⭐ Basic Utilities for PyTorch NLP by PetrochukM [GitHub, 2229 stars]
NLP Applications in Bio, Finance, Legal and other industries
- ⭐ Blackstone - A spaCy pipeline and model for NLP on unstructured legal text [GitHub, 689 stars]
- ⭐ Sci spaCy - spaCy pipeline and models for scientific/biomedical documents [GitHub, 1960 stars]
- ⭐ FinBERT: Pre-Trained on SEC Filings for Financial NLP Tasks [GitHub, 204 stars]
- ⭐ LexNLP - Information retrieval and extraction for real, unstructured legal text [GitHub, 781 stars]
- ⭐ NerDL and NerCRF - Tutorial on Named Entity Recognition for Healthcare with SparkNLP
- ⭐ Legal Text Analytics - A list of selected resources dedicated to Legal Text Analytics [GitHub, 720 stars]
- ⭐ BioIE - A curated list of resources relevant to doing Biomedical Information Extraction [GitHub, 441 stars]
Note Section keywords: speech recognition
🔙 Back to the Table of Contents
General Speech Recognition
- ⭐ wav2letter - Automatic Speech Recognition Toolkit [GitHub, 6444 stars]
- ⭐ DeepSpeech - Baidu's DeepSpeech architecture [GitHub, 26750 stars]
- 📙 Acoustic Word Embeddings by Maria Obedkova [Blog, 2020]
- ⭐ kaldi - Kaldi is a toolkit for speech recognition [GitHub, 15401 stars]
- ⭐ awesome-kaldi - resources for using Kaldi [GitHub, 538 stars]
- ⭐ ESPnet - End-to-End Speech Processing Toolkit [GitHub, 9850 stars]
- 📙 HuBERT - Self-supervised representation learning for speech recognition, generation, and compression [Blog, June 2021]
Text to Speech / Speech Generation
- ⭐ FastSpeech - The Implementation of FastSpeech based on pytorch [GitHub, 880 stars]
- ⭐ TTS - a deep learning toolkit for Text-to-Speech [GitHub, 45454 stars]
- 🔱 NotebookLM - Google Gemini powered personal assistant / podcast generator
Speech to Text
- ⭐ whisper - Robust Speech Recognition via Large-Scale Weak Supervision, by OpenAI [GitHub, 101153 stars]
- ⭐ vibe - GUI tool to work with whisper, multilingual and cuda support included [GitHub, 6317 stars]
Datasets
- ⭐ VoxPopuli - large-scale multilingual speech corpus for representation learning [GitHub, 573 stars]
Note Section keywords: topic modeling
🔙 Back to the Table of Contents
Blogs
- 📙 Topic Modelling with PySpark and Spark NLP by Maria Obedkova [Spark, Blog, 2020]
- 📙 A Unique Approach to Short Text Clustering (Algorithmic Theory) by Brittany Bowers [Blog, 2020]
Frameworks for Topic Modeling
Repositories
- ⭐ Top2Vec [GitHub, 3107 stars]
- ⭐ Anchored Correlation Explanation Topic Modeling [GitHub, 307 stars]
- ⭐ Topic Modeling in Embedding Spaces [GitHub, 561 stars] Paper
- ⭐ TopicNet - A high-level interface for BigARTM library [GitHub, 143 stars]
- ⭐ BERTopic - Leveraging BERT and a class-based TF-IDF to create easily interpretable topics [GitHub, 7655 stars]
- ⭐ OCTIS - A python package to optimize and evaluate topic models [GitHub, 802 stars]
- ⭐ Contextualized Topic Models [GitHub, 1269 stars]
- ⭐ GSDMM - GSDMM: Short text clustering [GitHub, 359 stars]
Note Section keywords: keyword extraction
🔙 Back to the Table of Contents
Text Rank
- ⭐ PyTextRank - PyTextRank is a Python implementation of TextRank as a spaCy pipeline extension [GitHub, 2213 stars]
- ⭐ textrank - TextRank implementation for Python 3 [GitHub, 1269 stars]
RAKE - Rapid Automatic Keyword Extraction
- ⭐ rake-nltk - Rapid Automatic Keyword Extraction algorithm using NLTK [GitHub, 1084 stars]
- ⭐ yake - Single-document unsupervised keyword extraction [GitHub, 1861 stars]
- ⭐ RAKE-tutorial - A python implementation of the Rapid Automatic Keyword Extraction [GitHub, 372 stars]
- ⭐ rake-nltk - Rapid Automatic Keyword Extraction algorithm using NLTK [GitHub, 1084 stars]
Other Approaches
- ⭐ flashtext - Extract Keywords from sentence or Replace keywords in sentences [GitHub, 5714 stars]
- ⭐ BERT-Keyword-Extractor - Deep Keyphrase Extraction using BERT [GitHub, 261 stars]
- ⭐ keyBERT - Minimal keyword extraction with BERT [GitHub, 4182 stars]
- ⭐ KeyphraseVectorizers - vectorizers that extract keyphrases with part-of-speech patterns [GitHub, 267 stars]
Further Reading
- 📙 Adding a custom tokenizer to spaCy and extracting keywords from Chinese texts by Haowen Jiang [Blog, Feb 2021]
- 📙 How to Extract Relevant Keywords with KeyBERT [Blog, June 2021]
Note Section keywords: ethics, responsible NLP
🔙 Back to the Table of Contents
NLP and ML Interpretability
NLP-centric
- Explainability for Natural Language Processing - KDD'2021 Tutorial Slides [Presentation, August 2021]
- ⭐ ecco - Tools to visuals and explore NLP language models [GitHub, 2102 stars]
- ⭐ NLP Profiler - A simple NLP library allows profiling datasets with text columns [GitHub, 244 stars]
- ⭐ transformers-interpret - Model explainability that works seamlessly with transformers [GitHub, 1412 stars]
- ⭐ Awesome-explainable-AI - collection of research materials on explainable AI/ML [GitHub, 1643 stars]
- ⭐ LAMA - LAMA is a probe for analyzing the factual and commonsense knowledge contained in pretrained language models [GitHub, 1387 stars]
General
- ⭐ Language Interpretability Tool (LIT) [GitHub, 3654 stars]
- ⭐ WhatLies - Toolkit to help visualise - what lies in word embeddings [GitHub, 480 stars]
- ⭐ Interpret-Text - Interpretability techniques and visualization dashboards for NLP models [GitHub, 432 stars]
- ⭐ InterpretML - Fit interpretable models. Explain blackbox machine learning [GitHub, 6866 stars]
- ⭐ thermostat - Collection of NLP model explanations and accompanying analysis tools [GitHub, 143 stars]
- ⭐ Dodrio - Exploring attention weights in transformer-based models with linguistic knowledge [GitHub, 372 stars]
- ⭐ imodels - package for concise, transparent, and accurate predictive modeling [GitHub, 1592 stars]
Ethics, Bias, and Equality in NLP
- 📙 Bias in Natural Language Processing @EMNLP 2020 [Blog, Nov 2020]
- 🎥️ Machine Learning as a Software Engineering Enterprise - NeurIPS 2020 Keynote [Presentation, Dec 2020]
- 🗂️ Ethics in NLP - resources from ACLs Ethics in NLP track
- 🗂️ The Institute for Ethical AI & Machine Learning
- 📙 Understanding the Capabilities, Limitations, and Societal Impact of Large Language Models [Paper, Feb 2021]
- ⭐ Fairness-in-AI - this package is used to detect and mitigate biases in NLP tasks [GitHub, 101 stars]
- ⭐ nlg-bias - dataset + classifier tools to study social perception biases in natural language generation [GitHub, 72 stars]
- 🗂️ bias-in-nlp - list of papers related to bias in NLP [GitHub, 11 stars]
Adversarial Attacks for NLP
- 📙 Privacy Considerations in Large Language Models [Blog, Dec 2020]
- ⭐ DeepWordBug - Generation of Adversarial Text Sequences to Evade Deep Learning Classifiers [GitHub, 85 stars]
- ⭐ Adversarial-Misspellings - Combating Adversarial Misspellings with Robust Word Recognition [GitHub, 64 stars]
Hate Speech Analysis
- ⭐ HateXplain - BERT for detecting abusive language [GitHub, 240 stars]
NLP & Security
- ⭐ vuln2vc - domain-specific Word2Vec model for cybersecurity text mining and NLP research [GitHub, 2 stars]
Note Section keywords: frameworks
🔙 Back to the Table of Contents
General Purpose
- ⭐ spaCy by Explosion AI [GitHub, 33624 stars]
- ⭐ flair by Zalando [GitHub, 14377 stars]
- ⭐ AllenNLP by AI2 [GitHub, 11897 stars]
- ⭐ stanza (former Stanford NLP) [GitHub, 7806 stars]
- ⭐ spaCy stanza [GitHub, 748 stars]
- ⭐ nltk [GitHub, 14634 stars]
- ⭐ gensim - framework for topic modeling [GitHub, 16421 stars]
- ⭐ pororo - Platform of neural models for natural language processing [GitHub, 1307 stars]
- ⭐ NLP Architect - A Deep Learning NLP/NLU library by Intel® AI Lab [GitHub, 2933 stars]
- ⭐ FARM [GitHub, 1754 stars]
- ⭐ gobbli by RTI International [GitHub, 274 stars]
- ⭐ headliner - training and deployment of seq2seq models [GitHub, 228 stars]
- ⭐ SyferText - A privacy preserving NLP framework [GitHub, 198 stars]
- ⭐ DeText - Text Understanding Framework for Ranking and Classification Tasks [GitHub, 1265 stars]
- ⭐ TextHero - Text preprocessing, representation and visualization [GitHub, 2911 stars]
- ⭐ textblob - TextBlob: Simplified Text Processing [GitHub, 9536 stars]
- ⭐ AdaptNLP - A high level framework and library for NLP [GitHub, 407 stars]
- ⭐ textacy - NLP, before and after spaCy [GitHub, 2241 stars]
- ⭐ texar - Toolkit for Machine Learning, Natural Language Processing, and Text Generation, in TensorFlow [GitHub, 2391 stars]
- ⭐ jiant - jiant is an NLP toolkit [GitHub, 1675 stars]
Data Augmentation
- ⭐ WildNLP Text manipulation library to test NLP models [GitHub, 76 stars]
- ⭐ snorkel Framework to generate training data [GitHub, 5970 stars]
- ⭐ NLPAug Data augmentation for NLP [GitHub, 4657 stars]
- ⭐ SentAugment Data augmentation by retrieving similar sentences from larger datasets [GitHub, 358 stars]
- ⭐ faker - Python package that generates fake data for you [GitHub, 19253 stars]
- ⭐ textflint - Unified Multilingual Robustness Evaluation Toolkit for NLP [GitHub, 651 stars]
- ⭐ Parrot - Practical and feature-rich paraphrasing framework [GitHub, 919 stars]
- ⭐ AugLy - data augmentations library for audio, image, text, and video [GitHub, 5084 stars]
- ⭐ TextAugment - Python 3 library for augmenting text for natural language processing applications [GitHub, 438 stars]
Adversarial NLP Attacks & Behavioral Testing
- ⭐ TextAttack - framework for adversarial attacks, data augmentation, and model training in NLP [GitHub, 3427 stars]
- ⭐ CleverHans - adversarial example library for constructing NLP attacks and building defenses [GitHub, 6438 stars]
- ⭐ CheckList - Beyond Accuracy: Behavioral Testing of NLP models [GitHub, 2050 stars]
Transformer-oriented
- ⭐ transformers by HuggingFace [GitHub, 161166 stars]
- ⭐ Adapter Hub and its documentation - Adapter modules for Transformers [GitHub, 2812 stars]
- ⭐ haystack - Transformers at scale for question answering & neural search. [GitHub, 25432 stars]
Dialogue Systems and Speech, Voice Agents
- ⭐ DeepPavlov by MIPT [GitHub, 6986 stars]
- ⭐ ParlAI by FAIR [GitHub, 10627 stars]
- ⭐ rasa - Framework for Conversational Agents [GitHub, 21190 stars]
- ⭐ wav2letter - Automatic Speech Recognition Toolkit [GitHub, 6444 stars]
- ⭐ ChatterBot - conversational dialog engine for creating chatbots [GitHub, 14488 stars]
- ⭐ SpeechBrain - open-source and all-in-one speech toolkit based on PyTorch [GitHub, 11581 stars]
- ⭐ dialoguefactory Generate continuous dialogue data in a simulated textual world [GitHub, 5 stars]
- ⭐ gabber AI applications that can see, hear, and speak using your screens, microphones [GitHub, 1103 stars]
Word/Sentence-embeddings oriented
- ⭐ MUSE A library for Multilingual Unsupervised or Supervised word Embeddings [GitHub, 3246 stars]
- ⭐ vecmap A framework to learn cross-lingual word embedding mappings [GitHub, 654 stars]
- ⭐ sentence-transformers - Multilingual Sentence & Image Embeddings with BERT [GitHub, 18765 stars]
Social Media Oriented
- ⭐ Ekphrasis - text processing tool, geared towards text from social networks [GitHub, 676 stars]
Phonetics
- ⭐ DeepPhonemizer - grapheme to phoneme conversion with deep learning [GitHub, 426 stars]
Morphology
- ⭐ LemmInflect - python module for English lemmatization and inflection [GitHub, 280 stars]
- ⭐ Inflect - generate plurals, ordinals, indefinite articles [GitHub, 1076 stars]
- ⭐ simplemma - simple multilingual lemmatizer for Python [GitHub, 1076 stars]
Multi-lingual tools
- ⭐ polyglot - Multi-lingual NLP Framework [GitHub, 2367 stars]
- ⭐ trankit - Light-Weight Transformer-based Python Toolkit for Multilingual NLP [GitHub, 795 stars]
Distributed NLP / Multi-GPU NLP
- ⭐ Spark NLP [GitHub, 4131 stars]
- ⭐ Parallelformers: An Efficient Model Parallelization Toolkit for Deployment [GitHub, 788 stars]
Machine Translation
- ⭐ COMET -A Neural Framework for MT Evaluation [GitHub, 756 stars]
- ⭐ marian-nmt - Fast Neural Machine Translation in C++ [GitHub, 1449 stars]
- ⭐ argos-translate - Open source neural machine translation in Python [GitHub, 6092 stars]
- ⭐ Opus-MT - Open neural machine translation models and web services [GitHub, 819 stars]
- ⭐ dl-translate - A deep learning-based translation library built on Huggingface transformers [GitHub, 497 stars]
- ⭐ CTranslate2 - CTranslate2 end-to-end machine translation [GitHub, 4507 stars]
Entity and String Matching
- ⭐ PolyFuzz - Fuzzy string matching, grouping, and evaluation [GitHub, 797 stars]
- ⭐ pyahocorasick - Python module implementing Aho-Corasick algorithm for string matching [GitHub, 1104 stars]
- ⭐ fuzzywuzzy - Fuzzy String Matching in Python [GitHub, 9259 stars]
- ⭐ jellyfish - approximate and phonetic matching of strings [GitHub, 2215 stars]
- ⭐ textdistance - Compute distance between sequences [GitHub, 3534 stars]
- ⭐ DeepMatcher - Compute distance between sequences [GitHub, 617 stars]
- ⭐ RE2 - Simple and Effective Text Matching with Richer Alignment Features [GitHub, 341 stars]
- ⭐ Machamp - Machamp: A Generalized Entity Matching Benchmark [GitHub, 21 stars]
- ⭐ bge-m3 - BGE-M3 hybrid retrieval + re-ranking GitHub [GitHub, 11753 stars]
- 📙 Cosine_Similarity_Explainer - Semantic Similarity Explainer with AI
Discourse Analysis
- ⭐ ConvoKit - Cornell Conversational Analysis Toolkit [GitHub, 635 stars]
PII scrubbing
- ⭐ scrubadub - Clean personally identifiable information from dirty dirty text [GitHub, 425 stars]
Hastag Segmentation
- ⭐ hashformers - automatically inserting the missing spaces between the words in a hashtag [GitHub, 77 stars]
Books Analysis / Literary Analysis / Semantic Search
- ⭐ booknlp - a natural language processing pipeline that scales to books and other long documents (in English) [GitHub, 918 stars]
- ⭐ bookworm - ingests novels, builds an implicit character network and a deeply analysable graph [GitHub, 76 stars]
- ⭐ SemanticFinder - frontend-only live semantic search with transformers.js [GitHub, 327 stars]
Non-English oriented
Japanese
- ⭐ fugashi - Cython MeCab wrapper for fast, pythonic Japanese tokenization and morphological analysis [GitHub, 521 stars]
- ⭐ SudachiPy - SudachiPy is a Python version of Sudachi, a Japanese morphological analyzer [GitHub, 437 stars]
- ⭐ Konoha - easy-to-use Japanese Text Processing tool, which makes it possible to switch tokenizers with small changes of code [GitHub, 261 stars]
- ⭐ jProcessing - Japanese Natural Langauge Processing Libraries [GitHub, 147 stars]
- ⭐ Ginza - Japanese NLP Library using spaCy as framework based on Universal Dependencies [GitHub, 854 stars]
- ⭐ kuromoji - self-contained and very easy to use Japanese morphological analyzer designed for search [GitHub, 1043 stars]
- ⭐ nagisa - Japanese tokenizer based on recurrent neural networks [GitHub, 417 stars]
- ⭐ KyTea - Kyoto Text Analysis Toolkit for word segmentation and pronunciation estimation [GitHub, 213 stars]
- ⭐ Jigg - Pipeline framework for easy natural language processing [GitHub, 77 stars]
- ⭐ Juman++ - Juman++ (a Morphological Analyzer Toolkit) [GitHub, 414 stars]
- ⭐ RakutenMA - morphological analyzer (word segmentor + PoS Tagger) for Chinese and Japanese written purely in JavaScript [GitHub, 472 stars]
- ⭐ toiro - a comparison tool of Japanese tokenizers [GitHub, 123 stars]
Thai
- ⭐ AttaCut - Fast and Reasonably Accurate Word Tokenizer for Thai [GitHub, 96 stars]
- ⭐ ThaiLMCut - Word Tokenizer for Thai Language [GitHub, 17 stars]
Chinese
- ⭐ Spacy-pkuseg - The pkuseg toolkit for multi-domain Chinese word segmentation [GitHub, 70 stars]
Ukrainian
- ⭐ recruitment-dataset - Recruitment Dataset Preprocessing and Recommender System (Ukrainian, English)
Other
- ⭐ textblob-de - TextBlob: Simplified Text Processing for German [GitHub, 103 stars]
- ⭐ Kashgari Transfer Learning with focus on Chinese [GitHub, 2384 stars]
- ⭐ Underthesea - Vietnamese NLP Toolkit [GitHub, 1737 stars]
- ⭐ PTT5 - Pretraining and validating the T5 model on Brazilian Portuguese data [GitHub, 91 stars]
Text Data Labelling & Classification
- ⭐ Small-Text - Active Learning for Text Classifcation in Python [GitHub, 643 stars]
- ⭐ Doccano - open source annotation tool for machine learning practitioners [GitHub, 10659 stars]
- ⭐ Adala - Autonomous DAta (Labeling) Agent framework [GitHub, 1593 stars]
- ⭐ EDA - Easy Data Augmentation Techniques for Boosting Performance on Text Classification Tasks [GitHub, 1652 stars]
- 🔱 Prodigy - annotation tool powered by active learning [Paid Service]
Note Section keywords: learn NLP
🔙 Back to the Table of Contents
General
- 📙 Learn NLP the practical way [Blog, Nov. 2019]
- 📙 Learn NLP the Stanford way (+Part 2) [Blog, Nov 2020]
- 📙 Choosing the right course for a Practical NLP Engineer
- 📙 12 Best Natural Language Processing Courses & Tutorials to Learn Online
- ⭐ Treasure of Transformers - Natural Language processing papers, videos, blogs, official repos along with colab Notebooks [GitHub, 1143 stars]
- 🎥️ Rasa Algorithm Whiteboard - YouTube series by Rasa explaining various Data Science and NLP Algorithms
- 🎥️ ExplosionAI Videos - YouTube series by ExplosionAI teaching you how to use spacy and apply it for NLP
Courses
- 🎥️ CS25: Transformers United Stanford - Fall 2021 [Course, Fall 2021]
- 📙 NLP Course | For You - Great and interactive course on NLP
- 📙 Advanced NLP with spaCy - how to use spaCy to build advanced natural language understanding systems
- 📙 Transformer models for NLP by HuggingFace
- 🎥️ Stanford NLP Seminar - slides from the Stanford NLP course
Books
- 📙 Applied Natural Language Processing in the Enterprise - [Book, May 2021]
- 📙 Practical Natural Language Processing - [Book, June 2020]
- 📙 Dive into Deep Learning - An interactive deep learning book with code, math, and discussions
- 📙 Natural Language Processing and Computational Linguistics - Speech, Morphology and Syntax (Cognitive Science)
Tutorials
- ⭐ nlp-tutorial - A list of NLP(Natural Language Processing) tutorials built on PyTorch [GitHub, 1374 stars]
- ⭐ nlp-tutorial - Natural Language Processing Tutorial for Deep Learning Researchers [GitHub, 14901 stars]
- ⭐ Hands-On NLTK Tutorial [GitHub, 572 stars]
- ⭐ Modern Practical Natural Language Processing [GitHub, 266 stars]
- ⭐ Transformers-Tutorials - demos with the Transformers library by HuggingFace [GitHub, 11633 stars]
- 🗂️ CalmCode Tutorials - Set of Python Data Science Tutorials
- r/LanguageTechnology - NLP Reddit forum
🔙 Back to the Table of Contents
Tokenization
- ⭐ tokenizers - Fast State-of-the-Art Tokenizers optimized for Research and Production [GitHub, 10782 stars]
- ⭐ SentencePiece - Unsupervised text tokenizer for Neural Network-based text generation [GitHub, 11870 stars]
- ⭐ SoMaJo - A tokenizer and sentence splitter for German and English web and social media texts [GitHub, 152 stars]
Data Augmentation and Weak Supervision
Libraries and Frameworks
- ⭐ WildNLP Text manipulation library to test NLP models [GitHub, 76 stars]
- ⭐ NLPAug Data augmentation for NLP [GitHub, 4657 stars]
- ⭐ SentAugment Data augmentation by retrieving similar sentences from larger datasets [GitHub, 358 stars]
- ⭐ TextAttack - framework for adversarial attacks, data augmentation, and model training in NLP [GitHub, 3427 stars]
- ⭐ skweak - software toolkit for weak supervision applied to NLP tasks [GitHub, 927 stars]
- ⭐ NL-Augmenter - Collaborative Repository of Natural Language Transformations [GitHub, 787 stars]
- ⭐ EDA - Easy Data Augmentation Techniques for Boosting Performance on Text Classification Tasks [GitHub, 1652 stars]
- ⭐ snorkel Framework to generate training data [GitHub, 5970 stars]
- ⭐ dialoguefactory Generate continuous dialogue data in a simulated textual world [GitHub, 5 stars]
Reading Material and Tutorials
- ⭐ A Survey of Data Augmentation Approaches for NLP [Paper, May 2021] GitHub Link
- 📙 A Visual Survey of Data Augmentation in NLP [Blog, 2020]
- 📙 Weak Supervision: A New Programming Paradigm for Machine Learning [Blog, March 2019]
Named Entity Recognition (NER)
- ⭐ Datasets for Entity Recognition [GitHub, 1573 stars]
- ⭐ Datasets to train supervised classifiers for Named-Entity Recognition [GitHub, 344 stars]
- ⭐ Bootleg - Self-Supervision for Named Entity Disambiguation at the Tail [GitHub, 218 stars]
- ⭐ Few-NERD - Large-scale, fine-grained manually annotated named entity recognition dataset [GitHub, 400 stars]
Relation Extraction
- ⭐ tacred-relation TACRED: position-aware attention model for relation extraction [GitHub, 359 stars]
- ⭐ tacrev TACRED Revisited: A Thorough Evaluation of the TACRED Relation Extraction Task [GitHub, 70 stars]
- ⭐ tac-self-attention Relation extraction with position-aware self-attention [GitHub, 66 stars]
- ⭐ Re-TACRED Re-TACRED: Addressing Shortcomings of the TACRED Dataset [GitHub, 55 stars]
Coreference Resolution
- ⭐ NeuralCoref 4.0: Coreference Resolution in spaCy with Neural Networks by HuggingFace [GitHub, 2889 stars]
- ⭐ coref - BERT and SpanBERT for Coreference Resolution [GitHub, 453 stars]
Sentiment Analysis
- ⭐ Reading list for Awesome Sentiment Analysis papers by declare-lab [GitHub, 538 stars]
- ⭐ Awesome Sentiment Analysis by xiamx [GitHub, 931 stars]
Domain Adaptation
- ⭐ Neural Adaptation in Natural Language Processing - curated list [GitHub, 266 stars]
Low Resource NLP
- ⭐ CMU LTI Low Resource NLP Bootcamp 2020 - CMU Language Technologies Institute low resource NLP bootcamp 2020 [GitHub, 606 stars]
Spell Correction / Error Correction
- ⭐ Gramformer - ramework for detecting, highlighting and correcting grammatical errors [GitHub, 1581 stars]
- ⭐ NeuSpell - A Neural Spelling Correction Toolkit [GitHub, 712 stars]
- ⭐ SymSpellPy - Python port of SymSpell [GitHub, 871 stars]
- 📙 Speller100 by Microsoft [Blog, Feb 2021]
- ⭐ JamSpell - spell checking library - accurate, fast, multi-language [GitHub, 662 stars]
- ⭐ pycorrector - spell correction for Chinese [GitHub, 6452 stars]
- ⭐ contractions - Fixes contractions such as
you'reto youare[GitHub, 318 stars] - 📙 Fine Tuning T5 for Grammar Correction by Sachin Abeywardana [Blog, Nov 2022]
PDF Parsing
- ⭐ spacy-layout - Process PDFs, Word documents and more with spaCy [GitHub, 902 stars]
- ⭐ bentopdf - A Privacy First PDF Toolkit [GitHub, 13552 stars]
Style Transfer for NLP
- ⭐ Styleformer - Neural Language Style Transfer framework [GitHub, 494 stars]
- ⭐ StylePTB - A Compositional Benchmark for Fine-grained Controllable Text Style Transfer [GitHub, 64 stars]
Automata Theory for NLP
- ⭐ pyahocorasick - Python module implementing Aho-Corasick algorithm for string matching [GitHub, 1104 stars]
Obscene words detection
- ⭐ LDNOOBW - List of Dirty, Naughty, Obscene, and Otherwise Bad Words [GitHub, 3376 stars]
Reddit Analysis
- ⭐ Subreddit Analyzer - comprehensive Data and Text Mining workflow for submissions and comments from any given public subreddit [GitHub, 499 stars]
Skill Detection
- ⭐ SkillNER - rule based NLP module to extract job skills from text [GitHub, 209 stars]
Reinforcement Learning for NLP
- ⭐ nlp-gym - NLPGym - A toolkit to develop RL agents to solve NLP tasks [GitHub, 201 stars]
AutoML / AutoNLP
- ⭐ AutoNLP - Faster and easier training and deployments of SOTA NLP models [GitHub, 4574 stars]
- ⭐ TPOT - Python Automated Machine Learning tool [GitHub, 10047 stars]
- ⭐ Auto-PyTorch - Automatic architecture search and hyperparameter optimization for PyTorch [GitHub, 2534 stars]
- ⭐ HungaBunga - Brute-Force all sklearn models with all parameters using .fit .predict [GitHub, 0 stars]
- ⭐ Optuna - hyperparameter optimization framework [GitHub, 14280 stars]
- ⭐ FLAML - fast and lightweight AutoML library [GitHub, 4360 stars]
- ⭐ Gradsflow - open-source AutoML & PyTorch Model Training Library [GitHub, 307 stars]
OCR - Optical Character Recognition
- 🎥️ A framework for designing document processing solutions [Blog, June 2022]
Document AI
Text Generation
- ⭐ keytotext - a model which will take keywords as inputs and generate sentences as outputs [GitHub, 451 stars]
- 📙 Controllable Neural Text Generation [Blog, Jan 2021]
- ⭐ BARTScore Evaluating Generated Text as Text Generation [GitHub, 369 stars]
Title / Headlines Generation
- ⭐ TitleStylist Learning to Generate Headlines with Controlled Styles [GitHub, 78 stars]
NLP research reproducibility
- 📙 A Systematic Review of Reproducibility Research in Natural Language Processing [Paper, March 2021]
License CC0
Attributions
Resources
- All linked resources belong to original authors
Icons
- Akropolis by parkjisun from the Noun Project
- Book of Ester by Gilad Sotil from the Noun Project
- quill by Juan Pablo Bravo from the Noun Project
- acting by Flatart from the Noun Project
- olympic by supalerk laipawat from the Noun Project
- aristocracy by Eucalyp from the Noun Project
- Horn by Eucalyp from the Noun Project
- temple by Eucalyp from the Noun Project
- constellation by Eucalyp from the Noun Project
- ancient greek round pattern by Olena Panasovska from the Noun Project
- Harp by Vectors Point from the Noun Project
- Atlas by parkjisun from the Noun Project
- Parthenon by Eucalyp from the Noun Project
- papyrus by IconMark from the Noun Project
- papyrus by Smalllike from the Noun Project
- pegasus by Saeful Muslim from the Noun Project















