PEARL (Learning High-Quality and General-Purpose Phrase Representations)

February 29, 2024 · View on GitHub

Our PEARL is a framework to learn phrase-level representations.
If you require semantic similarity computation for strings, our PEARL model might be a helpful tool.
It offers powerful embeddings suitable for tasks like string matching, entity retrieval, entity clustering, and fuzzy join.

Model	Size	PPDB	PPDB filtered	Turney	BIRD	YAGO	UMLS	CoNLL	BC5CDR	AutoFJ	Avg
FastText	-	94.4	61.2	59.6	58.9	16.9	14.5	3.0	0.2	53.6	40.3
Sentence-BERT	110M	94.6	66.8	50.4	62.6	21.6	23.6	25.5	48.4	57.2	50.1
Phrase-BERT	110M	96.8	68.7	57.2	68.8	23.7	26.1	35.4	59.5	66.9	54.5
E5-small	34M	96.0	56.8	55.9	63.1	43.3	42.0	27.6	53.7	74.8	57.0
E5-base	110M	95.4	65.6	59.4	66.3	47.3	44.0	32.0	69.3	76.1	61.1
PEARL-small	34M	97.0	70.2	57.9	68.1	48.1	44.5	42.4	59.3	75.2	62.5
PEARL-base	110M	97.3	72.2	59.7	72.6	50.7	45.8	39.3	69.4	77.1	64.8

Cost comparison of FastText and PEARL. The estimated memory is calculated by the number of parameters (float16). The unit of inference speed is *ms/512 samples. The FastText model here is crawl-300d-2M-subword.bin.

Model	Avg Score	Estimated Memory	Speed GPU	Speed CPU
FastText	40.3	1200MB	-	57ms
PEARL-small	62.5	68MB	42ms	446ms
PEARL-base	64.8	220MB	89ms	1394ms

Usage

Check out our model on Huggingface: 🤗 PEARL-small 🤗 PEARL-base

from sentence_transformers import SentenceTransformer, util

query_texts = ["The New York Times"]
doc_texts = [ "NYTimes", "New York Post", "New York"]
input_texts = query_texts + doc_texts

model = SentenceTransformer("Lihuchen/pearl_small")
embeddings = model.encode(input_texts)
scores = util.cos_sim(embeddings[0], embeddings[1:]) * 100
print(scores.tolist())
# [[90.56318664550781, 79.65763854980469, 75.52056121826172]]

Evaluation

We evaluate phrase embeddings on a benchmark that contains 9 datasets of 5 different tasks. 🤗 PEARL-Benchmark

Paraphrase Classification: PPDB and PPDBfiltered (Wang et al., 2021)
Phrase Similarity: Turney (Turney, 2012) and BIRD (Asaadi et al., 2019)
Entity Retrieval: We constructed two datasets based on Yago (Pellissier Tanon et al., 2020) and UMLS (Bodenreider, 2004)
Entity Clustering: CoNLL 03 (Tjong Kim Sang, 2002) and BC5CDR (Li et al., 2016)
Fuzzy Join: AutoFJ benchmark (Li et al., 2021), which contains 50 diverse fuzzy-join datasets

-	PPDB	PPDB filtered	Turney	BIRD	YAGO	UMLS	CoNLL	BC5CDR	AutoFJ
Task	Paraphrase Classification	Paraphrase Classification	Phrase Similarity	Phrase Similarity	Entity Retrieval	Entity Retrieval	Entity Clustering	Entity Clustering	Fuzzy Join
Samples	23.4k	15.5k	2.2k	3.4k	10k	10k	5.0k	9.7k	50 subsets
Averaged Length	2.5	2.0	1.2	1.7	3.3	4.1	1.5	1.4	3.8
Metric	Acc	Acc	Acc	Pearson	Top-1 Acc	Top-1 Acc	NMI	NMI	Acc

Use our script to evaluate your model on PEARL benchmark

python eval.py -batch_size 8

Training

Download all needed training files: :inbox_tray: Download Training Files
There are five files in total:

freq_phrase.txt has more than 3M phrases
phrase_with_etype.txt has the entity label for the Phrase Type Classification
token_aug.jsonl has token-level augmentations
phrase_aug.jsonl has phrase-level augmentations
hard_negative.txt has pre-defined hard negatives

Put the downloaded files into source/train_data.

python main.py -help

Once completing the data preparation and environment setup, we can train the model via main.py.

python main.py -target_model intfloat/e5-small-v2 -dim 384

Citation

If you find our paper and code useful, please give us a citation :blush:

@article{chen2024learning,
  title={Learning High-Quality and General-Purpose Phrase Representations},
  author={Chen, Lihu and Varoquaux, Ga{\"e}l and Suchanek, Fabian M},
  journal={arXiv preprint arXiv:2401.10407},
  year={2024}
}