Stop Words
October 29, 2025 ยท View on GitHub
List of common stop words in various languages.
The words are normalized to Unicode's normal form C.
Maintaining the lists
There is a manage.py script useful for maintaining the word lists.
To merge the English word list with new lists, you can use the following:
python -m manage merge en /tmp/new_list.txt /tmp/another_new_list.txt
The language code above is used for two purposes:
- Determining the source file based on
languages.json - Determining the libICU locale to use when comparing words
If new words are added manually, you can use the following to maintain the sorting order:
python -m manage sort en
or simply
python -m manage sort-all
The management script contains code that can be used as a library. See the LanguageDataIndex class and the sort_word_list function for more details.
Available languages
- Arabic
- Bulgarian
- Catalan
- Chinese
- Czech
- Danish
- Dutch
- English
- Finnish
- French
- German
- Greek
- Gujarati
- Hindi
- Hebrew
- Hungarian
- Indonesian
- Malaysian
- Italian
- Japanese
- Korean
- Norwegian
- Polish
- Portuguese
- Romanian
- Russian
- Slovak
- Spanish
- Swedish
- Turkish
- Ukrainian
- Vietnamese
- Persian/Farsi
Contributing
You know how ;)
Programming languages support
Python: https://github.com/Alir3z4/python-stop-wordsdotnet: https://github.com/hklemp/dotnet-stop-wordsrust: https://github.com/cmccomb/rust-stop-words