README.md

December 27, 2025 · View on GitHub

Python 3.12+ GitHub Codacy Badge Codacy Badge DOI

Installation

git clone https://github.com/ChenghaoMou/text-dedup
cd text-dedup
uv sync

Documentation

Github Pages

Features

This repository contains a collection of text deduplication scripts that are ready to use, or modify based on your needs:

  • MinHash + MinHashLSH for near-duplicate detection
  • 64 or 128 bit SimHash
  • SuffixArray Substring exact deduplication
  • Bloom Filter exact deduplication

All algorithms use a config-based approach with TOML files for easy customization.

Quick Start

All deduplication scripts read from a config.toml file in the project root.

1. Configure your settings

Edit config.toml with your input data and algorithm settings:

MinHash Near Deduplication
[input]
input_type = "local_files"
file_type = "parquet"

[input.read_arguments]
path = "data/your_data"
split = "train"

[algorithm]
algorithm_name = "minhash"
text_column = "text"
seed = 42
batch_size = 10000
num_perm = 240
threshold = 0.7
false_positive_weight = 0.5
false_negative_weight = 0.5
hash_bits = 64
ngram_size = 5
check_false_positive = true

[output]
output_dir = "output"
clean_cache = false
save_clusters = true

[debug]
enable_profiling = false
SimHash Near Deduplication
[input]
input_type = "local_files"
file_type = "parquet"

[input.read_arguments]
path = "data/your_data"
split = "train"

[algorithm]
algorithm_name = "simhash"
text_column = "text"
hash_bits = 64
ngram_size = 3
bit_diff = 3

[output]
output_dir = "output"
clean_cache = false

[debug]
enable_profiling = false
Bloom Filter Exact Deduplication
[input]
input_type = "local_files"
file_type = "parquet"

[input.read_arguments]
path = "data/your_data"
split = "train"

[algorithm]
algorithm_name = "bloom_filter"
text_column = "text"
error_rate = 1e-5
expected_elements = 100000

[output]
output_dir = "output"
clean_cache = false

[debug]
enable_profiling = false
Suffix Array Substring Exact Deduplication
[input]
input_type = "local_files"
file_type = "parquet"

[input.read_arguments]
path = "data/your_data"
split = "train"

[algorithm]
algorithm_name = "suffix_array"
text_column = "text"
google_repo_path = "third_party/deduplicate-text-datasets"
merge_strategy = "longest"
length_threshold = 100
cache_dir = ".cache"

[output]
output_dir = "output"
clean_cache = false

[debug]
enable_profiling = false

2. Run the deduplication

# MinHash
python -m text_dedup.minhash

# SimHash
python -m text_dedup.simhash

# Bloom Filter
python -m text_dedup.bloom_filter

# Suffix Array
python -m text_dedup.suffix_array

Benchmarks

pinecone/core-2020-05-10-deduplication
AlgorithmPrecision (Duplicates)Recall (Duplicates)Precision (Non Duplicates)Recall (Non Duplicates)Macro F1 scoreAccuracyTime
MinHash0.95870.94160.94500.96110.95180.927711.09s
SimHash0.90380.73230.79930.93180.85150.8375626.11s
Exact Title Matching 10.8300.500.7090.9920.7570.746-
Simhash Matching 10.6970.2470.5980.9850.6310.616-
Document Vector Similarity 10.9120.7790.8610.9860.8850.883-
Hybrid Method 10.9080.8280.8990.9790.9040.903-
LaBSE20.9370.9230.9300.9430.9330.919-
Multilingual USE20.9170.9070.9180.9270.9170.909-
Multilingual E5-Base20.9310.9080.9190.9390.9240.920-
MinHash + LSH20.9290.9020.9150.9380.9210.918-
RETSim Partial-Dup20.9450.9410.9450.9490.9450.928-
RETSim Near-Dup20.9280.9370.9420.9340.9350.926-
NEWS-COPY

Adjusted Rand Index (ARI) on NEWS-COPY dataset:

Model/AlgorithmARITime
MinHash0.72933.01s
SimHash0.6463140.03s
n-gram 30.440-
SimHash20.695-
MinHash30.737-
MinHash20.783-
Multilingual USE20.730-
Multilingual E5-Base20.742-
S-BERT30.700-
RETSim Partial-Dup20.831-
RETSim Near-Dup20.704-
Re-ranking 30.937-
Bi-encoder 30.915-

Running Benchmarks

You can reproduce the benchmark results using the provided benchmark suite.

Quick Start with Just

# Run all benchmarks (both datasets, all algorithms)
just benchmark-all

# Run only CORE dataset benchmarks
just benchmark-core

# Run only NEWS-COPY dataset benchmarks
just benchmark-news

# Run specific algorithm on specific dataset
just benchmark-core-minhash
just benchmark-core-simhash
just benchmark-news-minhash
just benchmark-news-simhash

Configuration Files

Benchmark configuration files are located in configs/:

  • benchmark_core_minhash.toml - MinHash on CORE dataset
  • benchmark_core_simhash.toml - SimHash on CORE dataset
  • benchmark_news_minhash.toml - MinHash on NEWS-COPY dataset
  • benchmark_news_simhash.toml - SimHash on NEWS-COPY dataset

To customize benchmark parameters, edit the config files and adjust hyperparameters like num_perm, threshold, ngram_size, or bit_diff.

License

Apache 2.0

Citations

Generally, you can cite this repository as:

@software{chenghao_mou_2023_8364980,
  author       = {Chenghao Mou and
                  Chris Ha and
                  Kenneth Enevoldsen and
                  Peiyuan Liu},
  title        = {ChenghaoMou/text-dedup: Reference Snapshot},
  month        = sep,
  year         = 2023,
  publisher    = {Zenodo},
  version      = {2023.09.20},
  doi          = {10.5281/zenodo.8364980},
  url          = {https://doi.org/10.5281/zenodo.8364980}
}

Acknowledgements

This repository is inspired by the following projects, and is heavily influenced by lessons learned from my own participation in BigScience (Apache 2.0) and BigCode (Apache 2.0). There is a blog post about the journey. Feedbacks are welcome!

Footnotes

  1. Deduplication of Scholarly Documents using Locality Sensitive Hashing and Word Embeddings 2 3 4

  2. RETSim: Resilient and Efficient Text Similarity 2 3 4 5 6 7 8 9 10 11 12

  3. Noise-Robust De-Duplication at Scale 2 3 4 5