Pretraining Language Models with Text-Attributed Heterogeneous Graphs
October 13, 2023 ยท View on GitHub
Data Preparation
Please download the datasets from DatasetsForTHLM , and put it into ./Data
Model Pretraining
Example of training THLM on Patents dataset:
python main.py --dataset_name Patents
Get node embeddings
Obtain node embeddings for Patents, GoodReads and OAG_Venue in ./Downstream/preprocess_data
Example of obtaining node embeddings for Patents:
python Patent_features.py
Model Evaluation
-
Link Prediction for OAG_Venue:
./Downstream/Link-Train-OAG -
Link Prediction for Patents/GoodReads:
./Downstream/Link-Train-Patent -
Node Classification for OAG_Venue:
./Downstream/train-OAG -
Node Classification for Patents/GoodReads:
./Downstream/train-Patent
Pre-trained Language Models
We also provide the pre-trained language models on these three datasets at HuggingFace.
Environments:
- PyTorch 2.0.0
- transformers 4.23.1
- dgl 0.9.1
- tqdm
- numpy