Python Stop Words

November 3, 2025 · View on GitHub

================= Python Stop Words

.. image:: https://img.shields.io/pypi/v/stop-words.svg :target: https://pypi.org/project/stop-words/ :alt: PyPI version

.. image:: https://img.shields.io/pypi/pyversions/stop-words.svg :target: https://pypi.org/project/stop-words/ :alt: Python versions

.. image:: https://img.shields.io/pypi/l/stop-words.svg :target: https://github.com/Alir3z4/python-stop-words/blob/master/LICENSE :alt: License

.. contents:: Table of Contents :depth: 2 :local:

Overview

A Python library providing curated lists of stop words across 34+ languages. Stop words are common words (like "the", "is", "at") that are typically filtered out in natural language processing and text analysis tasks.

Key Features:

  • 34+ Languages - Extensive language support.
  • Performance - Built-in caching for fast repeated access.
  • Flexible - Custom filtering system for advanced use cases.
  • Zero Dependencies - Lightweight with no external requirements.

Available Languages

All the available languages supported by https://github.com/Alir3z4/stop-words

Each language is identified by both its ISO 639-1 language code (e.g., en) and full name (e.g., english).

Installation

Via pip (Recommended):

.. code-block:: bash

$ pip install stop-words

Via Git:

.. code-block:: bash

$ git clone --recursive https://github.com/Alir3z4/python-stop-words.git
$ cd python-stop-words
$ pip install -e .

Requirements:

  • Usually any version of Python that supports type hints and probably has not been marked as EOL.

Quick Start

Basic Usage


.. code-block:: python

    from stop_words import get_stop_words

    # Get English stop words using language code
    stop_words = get_stop_words('en')
    
    # Or use the full language name
    stop_words = get_stop_words('english')
    
    # Use in text processing
    text = "The quick brown fox jumps over the lazy dog"
    words = text.lower().split()
    filtered_words = [word for word in words if word not in stop_words]
    print(filtered_words)  # ['quick', 'brown', 'fox', 'jumps', 'lazy', 'dog']


Safe Loading

Use safe_get_stop_words() when you're not sure if a language is supported:

.. code-block:: python

from stop_words import safe_get_stop_words

# Returns empty list instead of raising an exception
stop_words = safe_get_stop_words('klingon')  # Returns []

# Works normally with supported languages
stop_words = safe_get_stop_words('fr')  # Returns French stop words

Advanced Usage

Checking Available Languages


.. code-block:: python

    from stop_words import AVAILABLE_LANGUAGES, LANGUAGE_MAPPING

    # List all available languages
    print(AVAILABLE_LANGUAGES)
    # ['arabic', 'bulgarian', 'catalan', ...]

    # View language code mappings
    print(LANGUAGE_MAPPING)
    # {'en': 'english', 'fr': 'french', ...}


Caching Control
~~~~~~~~~~~~~~~

By default, stop words are cached for performance. You can control this behavior:

.. code-block:: python

    from stop_words import get_stop_words, STOP_WORDS_CACHE

    # Disable caching for this call
    stop_words = get_stop_words('en', cache=False)
    
    # Clear the cache manually
    STOP_WORDS_CACHE.clear()
    
    # Check what's cached
    print(STOP_WORDS_CACHE.keys())  # ['english', 'french', ...]


Custom Filters
~~~~~~~~~~~~~~

Apply custom transformations to stop words using the filter system:

.. code-block:: python

    from stop_words import get_stop_words, add_filter, remove_filter

    # Add a global filter (applies to all languages)
    def remove_short_words(words, language):
        """Remove words shorter than 3 characters."""
        return [w for w in words if len(w) >= 3]
    
    add_filter(remove_short_words)
    stop_words = get_stop_words('en', cache=False)
    
    # Add a language-specific filter
    def uppercase_words(words):
        """Convert all words to uppercase."""
        return [w.upper() for w in words]
    
    add_filter(uppercase_words, language='english')
    stop_words = get_stop_words('en', cache=False)
    
    # Remove a filter when done
    remove_filter(uppercase_words, language='english')

**Note:** Filters only apply to newly loaded stop words, not cached ones. Use ``cache=False`` or clear the cache to apply new filters.


Practical Examples
------------------

Text Preprocessing
~~~~~~~~~~~~~~~~~~

.. code-block:: python

    from stop_words import get_stop_words
    import re

    def preprocess_text(text, language='en'):
        """Clean and filter text for NLP tasks."""
        stop_words = set(get_stop_words(language))
        
        # Convert to lowercase and extract words
        words = re.findall(r'\b\w+\b', text.lower())
        
        # Remove stop words
        filtered_words = [w for w in words if w not in stop_words]
        
        return filtered_words

    text = "The quick brown fox jumps over the lazy dog"
    print(preprocess_text(text))
    # ['quick', 'brown', 'fox', 'jumps', 'lazy', 'dog']


Multilingual Processing
~~~~~~~~~~~~~~~~~~~~~~~

.. code-block:: python

    from stop_words import get_stop_words

    def filter_multilingual_text(texts_dict):
        """Process texts in multiple languages.
        
        Args:
            texts_dict: Dictionary mapping language codes to text strings
        
        Returns:
            Dictionary with filtered words for each language
        """
        results = {}
        
        for lang_code, text in texts_dict.items():
            stop_words = set(get_stop_words(lang_code))
            words = text.lower().split()
            filtered = [w for w in words if w not in stop_words]
            results[lang_code] = filtered
        
        return results

    texts = {
        'en': 'The cat is on the table',
        'fr': 'Le chat est sur la table',
        'es': 'El gato está en la mesa'
    }
    
    print(filter_multilingual_text(texts))


Keyword Extraction
~~~~~~~~~~~~~~~~~~

.. code-block:: python

    from stop_words import get_stop_words
    from collections import Counter
    import re

    def extract_keywords(text, language='en', top_n=10):
        """Extract the most common meaningful words from text."""
        stop_words = set(get_stop_words(language))
        
        # Extract words and filter
        words = re.findall(r'\b\w+\b', text.lower())
        meaningful_words = [w for w in words if w not in stop_words and len(w) > 2]
        
        # Count and return top keywords
        word_counts = Counter(meaningful_words)
        return word_counts.most_common(top_n)

    article = """
    Python is a high-level programming language. Python is known for its 
    simplicity and readability. Many developers choose Python for data science.
    """
    
    keywords = extract_keywords(article)
    print(keywords)
    # [('python', 3), ('language', 1), ('high-level', 1), ...]


API Reference
-------------

Functions
~~~~~~~~~

``get_stop_words(language, *, cache=True)``
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^

Load stop words for a specified language.

**Parameters:**

* ``language`` (str): Language code (e.g., 'en') or full name (e.g., 'english')
* ``cache`` (bool, optional): Enable caching. Defaults to True.

**Returns:**

* ``list[str]``: List of stop words

**Raises:**

* ``StopWordError``: If language is unavailable or files are unreadable

**Example:**

.. code-block:: python

    stop_words = get_stop_words('en')
    stop_words = get_stop_words('french', cache=False)


``safe_get_stop_words(language)``
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^

Safely load stop words, returning empty list on error.

**Parameters:**

* ``language`` (str): Language code or full name

**Returns:**

* ``list[str]``: Stop words, or empty list if unavailable

**Example:**

.. code-block:: python

    stop_words = safe_get_stop_words('unknown')  # Returns []


``add_filter(func, language=None)``
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^

Register a filter function for stop word post-processing.

**Parameters:**

* ``func`` (Callable): Filter function
* ``language`` (str | None, optional): Language code or None for global filter

**Filter Signatures:**

* Language-specific: ``func(stopwords: list[str]) -> list[str]``
* Global: ``func(stopwords: list[str], language: str) -> list[str]``

**Example:**

.. code-block:: python

    def remove_short(words, lang):
        return [w for w in words if len(w) > 3]
    
    add_filter(remove_short)  # Global filter


``remove_filter(func, language=None)``
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^

Remove a previously registered filter.

**Parameters:**

* ``func`` (Callable): The filter function to remove
* ``language`` (str | None, optional): Language code or None

**Returns:**

* ``bool``: True if removed, False if not found

**Example:**

.. code-block:: python

    success = remove_filter(my_filter, language='english')


Constants
~~~~~~~~~

``AVAILABLE_LANGUAGES``
^^^^^^^^^^^^^^^^^^^^^^^^

List of all supported language names.

.. code-block:: python

    ['arabic', 'bulgarian', 'catalan', ...]


``LANGUAGE_MAPPING``
^^^^^^^^^^^^^^^^^^^^

Dictionary mapping language codes to full names.

.. code-block:: python

    {'en': 'english', 'fr': 'french', 'de': 'german', ...}


``STOP_WORDS_CACHE``
^^^^^^^^^^^^^^^^^^^^^

Dictionary storing cached stop words. Can be manually cleared.

.. code-block:: python

    STOP_WORDS_CACHE.clear()  # Clear all cached data


Exceptions
~~~~~~~~~~

``StopWordError``
^^^^^^^^^^^^^^^^^

Raised when a language is unavailable or files cannot be read.

.. code-block:: python

    try:
        stop_words = get_stop_words('invalid')
    except StopWordError as e:
        print(f"Error: {e}")


Performance Tips
----------------

1. **Use caching** - Keep ``cache=True`` (default) for repeated access to the same language
2. **Reuse stop word sets** - Convert to ``set()`` once for O(1) lookup performance:

   .. code-block:: python

       stop_words_set = set(get_stop_words('en'))
       # Fast membership testing
       is_stop_word = 'the' in stop_words_set

3. **Preload languages** - Load stop words during initialization, not in tight loops
4. **Use safe_get_stop_words** - Avoid try/except overhead when language availability is uncertain


Troubleshooting
---------------

**"Language unavailable" error**

* Check spelling and use either the language code or full name
* Verify the language is in ``AVAILABLE_LANGUAGES``
* See the `Available Languages`_ table above

**"File is unreadable" error**

* Ensure the package installed correctly: ``pip install --force-reinstall stop-words``
* Check file permissions in the installation directory
* Verify the ``stop-words`` subdirectory exists in the package

**Filters not applying**

* Filters only affect newly loaded stop words
* Clear the cache: ``STOP_WORDS_CACHE.clear()``
* Use ``cache=False`` when testing filters

**Performance issues**

* Ensure caching is enabled (default behavior)
* Convert stop word lists to sets for faster lookups
* Preload stop words outside of loops


Contributing
------------

Contributions are welcome! Here's how you can help:

1. **Add new languages** - Submit stop word lists for unsupported languages via https://github.com/Alir3z4/stop-words
2. **Improve existing lists** - Suggest additions or removals for existing languages via https://github.com/Alir3z4/stop-words
3. **Report bugs** - Open issues on GitHub
4. **Submit PRs** - Fix bugs or add features

**Repository:** https://github.com/Alir3z4/python-stop-words


License
-------

This project is licensed under the BSD 3-Clause License. See ``LICENSE`` file for details.


Changelog
---------

See `ChangeLog.rst <https://github.com/Alir3z4/python-stop-words/blob/master/ChangeLog.rst>`_ for version history.


Support
-------

* **Issues:** https://github.com/Alir3z4/python-stop-words/issues
* **PyPI:** https://pypi.org/project/stop-words/


Credits
-------

* Maintained by `Alireza Savand <https://github.com/Alir3z4>`_
* Stop word lists compiled from various open sources
* Contributors: See `GitHub contributors <https://github.com/Alir3z4/python-stop-words/graphs/contributors>`_


Related Projects
----------------
* `Stop Words <https://github.com/Alir3z4/stop-words>`_ - List of common stop words in various languages.
* `NLTK <https://www.nltk.org/>`_ - Natural Language Toolkit with extensive NLP features
* `spaCy <https://spacy.io/>`_ - Industrial-strength NLP library
* `TextBlob <https://textblob.readthedocs.io/>`_ - Simplified text processing


Indices and Tables
------------------

* `Available Languages`_
* `Quick Start`_
* `Advanced Usage`_
* `API Reference`_