Word2Pix: Word to Pixel Cross-Attention Transformer in Visual Grounding
November 17, 2022 ยท View on GitHub
Word2Pix: Word to Pixel Cross-Attention Transformer in Visual Grounding
Introduction
We propose Word2Pix: a one-stage visual grounding network based on encoder-decoder transformer architecture that enables learning for textual to visual feature correspondence via word to pixel attention. Each word from the query sentence is given an equal opportunity when attending to visual pixels through multiple stacks of transformer decoder layers; thus is able to focus on grounding-critical words rather than words that are dominant for the sentence representation. Refer to our paper for more details.
Prerequisites
- Python 3.8
- Pytorch 1.8.0
- Torchvision 0.9.0
Installation and Data preparation
- Clone the repository
git clone https://github.com/azurerain7/Word2Pix
- Download MSCOCO images(Refcoco/+/g images are subsets) and Refcoco/+/g annotations
- Download and extract COCO 2017 train and val images from http://cocodataset.org. The default directory paths should be the following structure:
./cocopth/
annotations/ # annotation json files
train2017/ # train images
val2017/ # val images
- Follow REFER to download the Refcoco/+/g annotations; then follow the training step 1 in MAttNet to extract relevant grounding information. To facilitate and skip this step, we provide preprocessed data for Refcoco/+/g for downloading through this link. The default directory paths should be like this:
./prepro/
refcoco_unc/
sentid2bert_feat/ # cached text feature folder
data.json # annotation json file
data.pkl # non-bert text feature(optional)
refcoco+_unc/
sentid2bert_feat/
data.json
data.pkl
refcocog_umd/
sentid2bert_feat/
data.json
data.pkl
- Run script cache_text_feat.py for each dataset to cache pretrained BERT feature for all text query(will be saved under ./prepro/dataset_split/sentid2bert_feat/.
CUDA_VISIBLE_DEVICES=0 python3.8 cache_text_feat.py --dataset_split=refcoco_unc
CUDA_VISIBLE_DEVICES=0 python3.8 cache_text_feat.py --dataset_split=refcoco+_unc
CUDA_VISIBLE_DEVICES=0 python3.8 cache_text_feat.py --dataset_split=refcocog_umd
Performance
We provide the pretrained model weights in GDRIVE, put them in ./ckpt folder.
| Datasets \ backbone | ResNet-101 |
|---|---|
| RefCOCO | val: 81.12 |
| testA: 84.39 | |
| testB: 78.12 | |
| RefCOCO+ | val: 69.46 |
| testA: 76.81 | |
| testB: 61.57 | |
| RefCOCOg | val-umd: 70.81 |
| test-umd: 71.34 |
Training and Evaluation
- Training, check all related arguments and flags in script.
CUDA_VISIBLE_DEVICES=0,1,2,3 python3.8 -m torch.distributed.launch --nproc_per_node=4 --use_env train_w2p.py
- Testing, check all related arguments and flags in script.
CUDA_VISIBLE_DEVICES=0 python3.8 test_w2p.py
Citation
@ARTICLE{9806393,
author={Zhao, Heng and Zhou, Joey Tianyi and Ong, Yew-Soon},
journal={IEEE Transactions on Neural Networks and Learning Systems},
title={Word2Pix: Word to Pixel Cross-Attention Transformer in Visual Grounding},
year={2022},
volume={},
number={},
pages={1-11},
doi={10.1109/TNNLS.2022.3183827}
}
Credits
Our code is built on DETR and partial of the codes are from MAttNet.