Zap Product Deduplication

April 7, 2026 · View on GitHub

A pipeline for merging duplicate product listings that carry inconsistent names (Hebrew, English, abbreviated, ALL-CAPS) and surfacing the lowest price per unique product.


Pipeline Architecture

Input CSV  (id, name, price, category [, group_id])


┌─────────────────────────────────────────────────────┐
│  Stage 0: Hebrew → English Normalization             │
│  • gpt-4o-mini translates Hebrew names once          │
│  • Results cached in cache/normalizations.json       │
│  • Pure-English names pass through unchanged         │
└──────────────────────┬──────────────────────────────┘
                       │ normalized names

┌─────────────────────────────────────────────────────┐
│  Stage 1: Embeddings + FAISS ANN        │
│  • text-embedding-3-small → 1536-dim vectors         │
│  • Cached in cache/embeddings.json                   │
│  • FAISS IndexFlatIP on L2-normalized vectors        │
│    → cosine similarity via inner product             │
└──────────────────────┬──────────────────────────────┘
                       │ candidate pairs

┌─────────────────────────────────────────────────────┐
│  Stage 2: Category-Aware Union-Find Clustering       │
│  • Only products in the same category are compared   │
│  • Union-Find merges pairs above threshold (0.72)    │
└──────────────────────┬──────────────────────────────┘
                       │ raw clusters

┌─────────────────────────────────────────────────────┐
│  Stage 3: LLM Cluster Refinement                     │
│  • Full cluster in one GPT-4o-mini call (not per-pair)    │
│  • Verifies membership, splits false positives       │
│  • Generates canonical English product name          │
│  • Hard rule: 256GB ≠ 512GB, 55" ≠ 65", 9KG ≠ 11KG  │
└──────────────────────┬──────────────────────────────┘
                       │ verified clusters + canonical names

┌─────────────────────────────────────────────────────┐
│  Stage 4: Merge & Min Price                          │
│  • Picks lowest-priced listing per group             │
│  → output/deduplicated_products.csv                  │
└──────────────────────┬──────────────────────────────┘
                       │ (only when group_id present)

┌─────────────────────────────────────────────────────┐
│  Evaluation                                          │
│  → output/evaluation.png                             │
│  → output/evaluation.json                            │
└─────────────────────────────────────────────────────┘

Quick Start

pip install -r requirements.txt
export OPENAI_API_KEY="sk-..."
python src/pipeline.py

The pipeline reads from data/products_benchmark.csv and writes to output/ (created automatically).
On repeat runs, both caches are loaded — zero API calls unless new names appear.


File Structure

├── data/
│   └── products_benchmark.csv   # 432 products | 14 categories | group_id as GT
├── src/
│   └── pipeline.py              # full pipeline (single file)
├── cache/
│   ├── embeddings.json          # embedding cache (auto-created)
│   └── normalizations.json      # Hebrew→English translation cache (auto-created)
├── output/                      # auto-created
│   ├── deduplicated_products.csv
│   ├── evaluation.png
│   └── evaluation.json
├── requirements.txt
└── README.md

Input Format

id,name,price,category[,group_id]
1,Samsung Galaxy S24 Ultra 256GB,4999,smartphones,SM-001
2,סמסונג גלקסי S24 אולטרה 256GB,4799,smartphones,SM-001
3,Samsung Galaxy S24 Ultra 512GB,5299,smartphones,SM-002

group_id is optional — when present the pipeline runs evaluation automatically at the end.

Key invariant: every variant within a group_id must refer to the exact same product including storage/capacity/screen size. Different specs = different group_id.


Output Format

CSV (output/deduplicated_products.csv) — flat table, one row per unique product:

ColumnDescription
canonical_nameNormalized English product name
categoryProduct category
min_priceLowest price found by the pipeline
gt_min_priceGround-truth minimum price (only when group_id is present)
best_deal_idID of the listing with the lowest price
best_deal_nameOriginal name of the best-deal listing
duplicate_countNumber of listings merged into this product
all_variantsAll listings, pipe-separated with prices

Evaluation Results

The pipeline is benchmarked on 432 product listings across 14 categories (129 unique products, each with 3–4 naming variants in Hebrew, English, abbreviated, and ALL-CAPS forms).

Evaluation Report

Reading the Chart

Left — Metrics bar chart
Nine metrics grouped by method. Color bands show which group each metric belongs to:

  • Blue — Pair-based (Precision / Recall / F1): The industry standard for deduplication. The unit of measurement is a pair of products. Unaffected by the large number of true-negative pairs.
  • Green — Cluster-level (Purity / Coverage): Purity > 1.0 is a known artifact of the formula when products are perfectly split; Coverage measures how many true duplicates were successfully grouped.
  • Orange — B-Cubed (Precision / Recall / F1): Computed per entity rather than per pair, so large clusters don't dominate the score.
  • Purple — Adjusted Rand Index: A single global partition-similarity score, corrected for chance. 1.0 = perfect, 0.0 = random.

Middle — Pair Breakdown
Raw counts of True Positives (correct merges), False Positives (wrong merges), and False Negatives (missed duplicates). The dominant error type is FN — the pipeline occasionally over-splits conservative clusters.

Right — Price Quality

  • Price Accuracy: fraction of deduplicated groups where the pipeline found the true minimum price.
  • Avg Overcharge: mean relative overpayment when the minimum was missed (kept near 0 by the LLM stage).

Current Numbers

MetricScore
Pair Precision1.000
Pair Recall0.914
Pair F10.955
Cluster Purity1.002
Cluster Coverage0.954
B-Cubed F10.968
Adjusted Rand Index0.953
Price Accuracy95.0%
Avg Overcharge0.4%

Key Parameters

ConstantDefaultEffect
SIMILARITY_THRESHOLD0.72Cosine similarity cutoff for Union-Find. Lower = wider candidate net; LLM handles more false positives.
TOP_K10FAISS neighbors checked per product. Raise for denser catalogs.

Scalability Notes

  • llm_refine makes one API call per cluster. Use the OpenAI Batch API at scale.
  • Both caches (embeddings.json, normalizations.json) make incremental runs free — only new/changed names hit the API.