Progressive Spatio-temporal Perception for Audio-Visual Question Answering (ACMMM'23) [arXiv]

August 11, 2023 · View on GitHub

PyTorch code accompanies our PSTP-Net.

Requirements

python3.6 +
pytorch1.6.0
tensorboardX
ffmpeg
numpy

Usage

Clone this repo

git clone https://github.com/GeWu-Lab/PSTP-Net.git

Download data

MUSIC-AVQA: https://gewu-lab.github.io/MUSIC-AVQA/

AVQA: http://mn.cs.tsinghua.edu.cn/avqa/

Feature extraction

feat_script/extract_clip_feat
python extract_patch-level_feat.py

Training

python main_train.py \
--temp_select True --segs 12 --top_k 2 \
--spat_select True --top_m 25 \
--a_guided_attn True \
--global_local True \
--batch-size 64 --epochs 30 --lr 1e-4 --gpu 0 \
--checkpoint PSTP_Net \
--model_save_dir models_pstp

Testing
```
python main_test.py
```

Citation

If you find this work useful, please consider citing it.

coming soon!

Acknowledgement

This research was supported by Public Computing Cloud, Renmin University of China.