Bert Pretrained Token Embeddings

January 22, 2019 ยท View on GitHub

BERT(BERT: Pre-training of Deep Bidirectional Transformers for Language Understanding) yields pretrained token (=subword) embeddings. Let's extract and save them in the word2vec format so that they can be used for downstream tasks.

Requirements

  • pytorch_pretrained_bert
  • NumPy
  • tqdm

Extraction

  • Check extract.py.

Bert (Pretrained) Token Embeddings in word2vec format

Models# Vocab# DimNotes
bert-base-uncased30,522768
bert-large-uncased30,5221024
bert-base-cased28,996768
bert-large-cased28,9961024
bert-base-multilingual-cased119,547768Recommended
bert-base-multilingual-uncased30,522768Not recommended
bert-base-chinese21,128768

Example

  • Check example.ipynb to see how to load (sub-)word vectors with gensim and plot them in 2d space using tSNE.

  • Related tokens to look

  • Related tokens to ##go