PEARL (Learning High-Quality and General-Purpose Phrase Representations)

February 29, 2024 ยท View on GitHub

| :book: paper | ๐Ÿค— PEARL-small | ๐Ÿค— PEARL-base | ๐Ÿค— PEARL-Benchmark | :floppy_disk: data |

Our PEARL is a framework to learn phrase-level representations.
If you require semantic similarity computation for strings, our PEARL model might be a helpful tool.
It offers powerful embeddings suitable for tasks like string matching, entity retrieval, entity clustering, and fuzzy join.

ModelSizePPDBPPDB filteredTurneyBIRDYAGOUMLSCoNLLBC5CDRAutoFJAvg
FastText-94.461.259.658.916.914.53.00.253.640.3
Sentence-BERT110M94.666.850.462.621.623.625.548.457.250.1
Phrase-BERT110M96.868.757.268.823.726.135.459.566.954.5
E5-small34M96.056.855.963.143.342.027.653.774.857.0
E5-base110M95.465.659.466.347.344.032.069.376.161.1
PEARL-small34M97.070.257.968.148.144.542.459.375.262.5
PEARL-base110M97.372.259.772.650.745.839.369.477.164.8

Cost comparison of FastText and PEARL. The estimated memory is calculated by the number of parameters (float16). The unit of inference speed is *ms/512 samples. The FastText model here is crawl-300d-2M-subword.bin.

ModelAvg ScoreEstimated MemorySpeed GPUSpeed CPU
FastText40.31200MB-57ms
PEARL-small62.568MB42ms446ms
PEARL-base64.8220MB89ms1394ms

Usage

Check out our model on Huggingface: ๐Ÿค— PEARL-small ๐Ÿค— PEARL-base

from sentence_transformers import SentenceTransformer, util

query_texts = ["The New York Times"]
doc_texts = [ "NYTimes", "New York Post", "New York"]
input_texts = query_texts + doc_texts

model = SentenceTransformer("Lihuchen/pearl_small")
embeddings = model.encode(input_texts)
scores = util.cos_sim(embeddings[0], embeddings[1:]) * 100
print(scores.tolist())
# [[90.56318664550781, 79.65763854980469, 75.52056121826172]]

Evaluation

We evaluate phrase embeddings on a benchmark that contains 9 datasets of 5 different tasks. ๐Ÿค— PEARL-Benchmark

-PPDBPPDB filteredTurneyBIRDYAGOUMLSCoNLLBC5CDRAutoFJ
TaskParaphrase ClassificationParaphrase ClassificationPhrase SimilarityPhrase SimilarityEntity RetrievalEntity RetrievalEntity ClusteringEntity ClusteringFuzzy Join
Samples23.4k15.5k2.2k3.4k10k10k5.0k9.7k50 subsets
Averaged Length2.52.01.21.73.34.11.51.43.8
MetricAccAccAccPearsonTop-1 AccTop-1 AccNMINMIAcc

Use our script to evaluate your model on PEARL benchmark

python eval.py -batch_size 8

Training

Download all needed training files: :inbox_tray: Download Training Files
There are five files in total:

  • freq_phrase.txt has more than 3M phrases
  • phrase_with_etype.txt has the entity label for the Phrase Type Classification
  • token_aug.jsonl has token-level augmentations
  • phrase_aug.jsonl has phrase-level augmentations
  • hard_negative.txt has pre-defined hard negatives

Put the downloaded files into source/train_data.

python main.py -help

Once completing the data preparation and environment setup, we can train the model via main.py.

python main.py -target_model intfloat/e5-small-v2 -dim 384

Citation

If you find our paper and code useful, please give us a citation :blush:

@article{chen2024learning,
  title={Learning High-Quality and General-Purpose Phrase Representations},
  author={Chen, Lihu and Varoquaux, Ga{\"e}l and Suchanek, Fabian M},
  journal={arXiv preprint arXiv:2401.10407},
  year={2024}
}