Word2Pix: Word to Pixel Cross-Attention Transformer in Visual Grounding

November 17, 2022 · View on GitHub

Introduction

We propose Word2Pix: a one-stage visual grounding network based on encoder-decoder transformer architecture that enables learning for textual to visual feature correspondence via word to pixel attention. Each word from the query sentence is given an equal opportunity when attending to visual pixels through multiple stacks of transformer decoder layers; thus is able to focus on grounding-critical words rather than words that are dominant for the sentence representation. Refer to our paper for more details.

Prerequisites

Python 3.8
Pytorch 1.8.0
Torchvision 0.9.0

Installation and Data preparation

Clone the repository

git clone https://github.com/azurerain7/Word2Pix

Download MSCOCO images(Refcoco/+/g images are subsets) and Refcoco/+/g annotations

Download and extract COCO 2017 train and val images from http://cocodataset.org. The default directory paths should be the following structure:

./cocopth/
  annotations/  # annotation json files
  train2017/    # train images
  val2017/      # val images

Follow REFER to download the Refcoco/+/g annotations; then follow the training step 1 in MAttNet to extract relevant grounding information. To facilitate and skip this step, we provide preprocessed data for Refcoco/+/g for downloading through this link. The default directory paths should be like this:

./prepro/
  refcoco_unc/  
    sentid2bert_feat/   # cached text feature folder
    data.json           # annotation json file
    data.pkl            # non-bert text feature(optional)
  refcoco+_unc/  
    sentid2bert_feat/   
    data.json           
    data.pkl            
  refcocog_umd/  
    sentid2bert_feat/   
    data.json           
    data.pkl

Run script cache_text_feat.py for each dataset to cache pretrained BERT feature for all text query(will be saved under ./prepro/dataset_split/sentid2bert_feat/.

CUDA_VISIBLE_DEVICES=0 python3.8 cache_text_feat.py --dataset_split=refcoco_unc
CUDA_VISIBLE_DEVICES=0 python3.8 cache_text_feat.py --dataset_split=refcoco+_unc
CUDA_VISIBLE_DEVICES=0 python3.8 cache_text_feat.py --dataset_split=refcocog_umd

Performance

We provide the pretrained model weights in GDRIVE, put them in ./ckpt folder.

Datasets \ backbone	ResNet-101
RefCOCO	val: 81.12
	testA: 84.39
	testB: 78.12
RefCOCO+	val: 69.46
	testA: 76.81
	testB: 61.57
RefCOCOg	val-umd: 70.81
RefCOCOg	test-umd: 71.34

Training and Evaluation

Training, check all related arguments and flags in script.

CUDA_VISIBLE_DEVICES=0,1,2,3 python3.8 -m torch.distributed.launch --nproc_per_node=4 --use_env train_w2p.py

Testing, check all related arguments and flags in script.

CUDA_VISIBLE_DEVICES=0 python3.8 test_w2p.py

Citation

@ARTICLE{9806393, 
  author={Zhao, Heng and Zhou, Joey Tianyi and Ong, Yew-Soon},  
  journal={IEEE Transactions on Neural Networks and Learning Systems},   
  title={Word2Pix: Word to Pixel Cross-Attention Transformer in Visual Grounding},   
  year={2022},  
  volume={},  
  number={},  
  pages={1-11},  
  doi={10.1109/TNNLS.2022.3183827}
  }

Credits

Our code is built on DETR and partial of the codes are from MAttNet.