Welcome to the scripts used to extract text features! ๐Ÿ˜Ž

March 19, 2024 ยท View on GitHub

Quick Star ๐Ÿš€

1. Download the datasets from CSTAG. ๐Ÿ‘

cd ../data/
sudo apt-get update && sudo apt-get install git-lfs && git clone https://huggingface.co/datasets/Sherirto/CSTAG
cd CSTAG && ls 

Now, you can see the Arxiv, Children, CitationV8, Computers, Fitness, Goodreads, History and Photo under the ''data/CSTAG'' folder.

2. Extract features on the datasets you care about with PLM on huggingface. ๐Ÿ‘‹

# Ensure that you are in the data/CSTAG/
cd ../../FeatureExtractor/

Extract features by LM4Feature.py

CUDA_VISIBLE_DEVICES=0 python LM4Feature.py --csv_file 'data/CSTAG/Arxiv/Arxiv.csv' --model_name 'bert-base-uncased' --name 'Arxiv' --path 'data/CSTAG/Arxiv/Feature/' --max_length 512 --batch_size 1000 --cls

If you have multiple GPUs, you can simply execute this code in parallel.If you have multiple GPUs, you can simply execute this code in parallel.

CUDA_VISIBLE_DEVICES=0,1,2,3 python LM4Feature.py --csv_file 'data/CSTAG/Arxiv/Arxiv.csv' --model_name 'bert-base-uncased' --name 'Arxiv' --path 'data/CSTAG/Arxiv/Feature/' --max_length 512 --batch_size 1000 --cls
cd ../data/CSTAG/Arxiv/Feature/ && ls

If you follow the example code above, then you can see the feature file named "Arxiv_bert_base_uncased_512_cls.npy". Where 'Arxiv' is determined by the --name 'Arxiv' in the script; '512' is determined by --max_length, and 'cls' is the default text representation.

3. Other ways of representing text.๐Ÿค—

In addition to the common use of [CLS] tokens in last hidden layer to represent the global representation of a sentence/document, we can also use Mean_Pooling to obtain textual representations.

# Ensure that you are in FeatureExtractor/
CUDA_VISIBLE_DEVICES=0 python LM4Feature.py --csv_file 'data/CSTAG/Arxiv/Arxiv.csv' --model_name 'bert-base-uncased' --name 'Arxiv' --path 'data/CSTAG/Arxiv/Feature/' --max_length 512 --batch_size 500 --mean
cd ../data/CSTAG/Arxiv/Feature/ && ls

Then you can see the feature file named "Arxiv_bert_base_uncased_512_mean.npy".

On the link prediction task, we find that the mean pooling method may lead to better results. Meanwhile, for some generative LLMs, such as LlamaV2, Mixture, it is more reasonable to use the mean pooling method to obtain the textual representation.

4. You can directly use the feature files we provide.๐Ÿ”ฅ

Combining performance and file size considerations, we provide node representations for each dataset obtained from Roberta-Base encoding.

And we will also provide textual representations obtained from large models such as LlamaV2 13B and Mixture 7B for research.