dirty_cat

March 12, 2025 · View on GitHub

dirty_cat has migrated to skrub <https://github.com/skrub-data/skrub>__ . This repository will no longer be maintained.

Use skrub, it has all the features of dirty-cat and more.

.. image:: https://dirty-cat.github.io/stable/_static/dirty_cat.svg :align: center :alt: dirty_cat logo

|

Do not use dirty_cat, but rather the skrub package

dirty_cat <https://dirty-cat.github.io/>_ was a Python library to facilitate machine-learning on dirty categorical variables.

Its functionalities are merged in the skrub <https://skrub-data.org>_

|

Dirty categories

For a detailed description of the problem of encoding dirty categorical data, see Similarity encoding for learning with dirty categorical variables <https://hal.inria.fr/hal-01806175>_ [1]_ and Encoding high-cardinality string categorical variables <https://hal.inria.fr/hal-02171256v4>_ [2]_.

What can dirty_cat do?

dirty_cat provides tools (TableVectorizer, fuzzy_join...) and encoders (GapEncoder, MinHashEncoder...) for morphological similarities, for which we usually identify three common cases: similarities, typos and variations

The first example notebook <https://dirty-cat.github.io/stable/auto_examples/01_dirty_categories.html>_ goes in-depth on how to identify and deal with dirty data using the dirty_cat library.

What dirty_cat does not


`Semantic similarities <https://en.wikipedia.org/wiki/Semantic_similarity>`_
are currently not supported.
For example, the similarity between *car* and *automobile* is outside the reach
of the methods implemented here.

This kind of problem is tackled by
`Natural Language Processing <https://en.wikipedia.org/wiki/Natural_language_processing>`_
methods.

`dirty_cat` can still help with handling typos and variations in this kind of setting.

Installation
------------

Please do not use dirty-cat anymore, but rather skrub, which has the same
features, replaces dirty-cat and can be easily installed via `pip`::

    pip install skrub

Dependencies
~~~~~~~~~~~~

Dependencies and minimal versions are listed in the `setup <https://github.com/dirty-cat/dirty_cat/blob/main/setup.cfg#L26>`_ file.

Related projects
----------------

`skrub <https://skrub-data.org>`_

Contributing
------------

If you want to encourage development of these functionality, the best
thing to do is to *spread the word* around `skrub <https://skrub-data.org>`_

And please contribute to `skrub <https://github.com/skrub-data/skrub>`_

Additional resources
--------------------

* `Introductory video (YouTube) <https://youtu.be/_GNaaeEI2tg>`_
* `Overview poster for EuroSciPy 2022 (Google Drive) <https://drive.google.com/file/d/1TtmJ3VjASy6rGlKe0txKacM-DdvJdIvB/view?usp=sharing>`_

References
----------

.. [1] Patricio Cerda, Gaël Varoquaux, Balázs Kégl. Similarity encoding for learning with dirty categorical variables. 2018. Machine Learning journal, Springer.
.. [2] Patricio Cerda, Gaël Varoquaux. Encoding high-cardinality string categorical variables. 2020. IEEE Transactions on Knowledge & Data Engineering.