GCG - ACM MM'24

July 22, 2024 ยท View on GitHub

Weakly Supervised Gaussian Contrastive Grounding with Large Multimodal Models for Video Question Answering [ACM MM'24]. This is the official implementation of the [Paper] accepted by ACM MM'24.

Install

  1. Clone this repository and navigate to GCG folder
git clone https://github.com/WHB139426/GCG.git
cd GCG
mkdir experiments
mkdir files
  1. Install Package
conda create -n gcg python=3.9.16
conda activate gcg
pip install -r requirements.txt

Pretrained Weights of InstructBLIP

You can prepare the pretrained weights of InstructBLIP-T5-XL according to [InstructBLIP].

Since we have changed the structure of the code of the model, we recommend you download the pretrained weights of EVA-CLIP, and QFormer directly in [๐Ÿค—HF]. The pretrained weights should be organized as follows,

โ”œโ”€โ”€ GCG
โ”‚   โ””โ”€โ”€ experiments
โ”‚     โ””โ”€โ”€ eva_vit_g.pth
โ”‚     โ””โ”€โ”€ qformer_t5.pth
โ”‚     โ””โ”€โ”€ query_tokens_t5.pth
โ”‚     โ””โ”€โ”€ llm_proj_t5.pth
โ”‚     โ””โ”€โ”€ eva_vit_post_layernorm.pth
โ”‚     โ””โ”€โ”€ eva_clip_text_model.pth
โ”‚     โ””โ”€โ”€ eva_clip_last_vision_head.pth
โ”‚     โ””โ”€โ”€ eva_clip_last_vision_norm.pth
โ”‚     โ””โ”€โ”€ eva_clip_last_vision_block.pth

Datasets

You should download the videos of NExT-QA from https://github.com/doc-doc/NExT-QA?tab=readme-ov-file or directly with the link [videos]. The downloaded videos should be in the folder nextqa/videos

We provide the annotation files in [๐Ÿค—HF], and you should organize the data as follows,

โ”œโ”€โ”€ nextqa
โ”‚   โ””โ”€โ”€ annotations_mc
โ”‚   โ””โ”€โ”€ frames_32
โ”‚   โ””โ”€โ”€ videos
โ”‚   โ””โ”€โ”€ vision_features
|   โ””โ”€โ”€ map_vid_vidorID.json
โ”œโ”€โ”€ GCG
โ”‚   โ””โ”€โ”€ datasets
โ”‚   โ””โ”€โ”€ models
โ”‚   โ””โ”€โ”€...

Then, you should extract 32 frames per video into the nextqa/frames_32 folder with the python scripts

python utils/extract_frames.py

After that, you should extract the video features in advance into the nextqa/vision_features with the python scripts

python utils/extract_features.py

Training

CUDA_VISIBLE_DEVICES=0,1,2,3,4,5,6,7 python -m torch.distributed.launch --nproc_per_node=8 --master_port=1111 finetune_ans.py