Visual Grounding with Transformers

May 27, 2022 · View on GitHub

Overview

This repository includes PyTorch implementation and trained models of VGTR(Visual Grounding with TRansformers).

[arXiv]

In this paper, we propose a transformer based approach for visual grounding. Unlike existing proposal-and-rank frameworks that rely heavily on pretrained object detectors or proposal-free frameworks that upgrade an off-the-shelf one-stage detector by fusing textual embeddings, our approach is built on top of a transformer encoder-decoder and is independent of any pretrained detectors or word embedding models. Termed as VGTR – Visual Grounding with TRansformers, our approach is designed to learn semantic-discriminative visual features under the guidance of the textual description without harming their location ability. This information flow enables our VGTR to have a strong capability in capturing context-level semantics of both vision and language modalities, rendering us to aggregate accurate visual clues implied by the description to locate the interested object instance. Experiments show that our method outperforms state-of-the-art proposal-free approaches by a considerable margin on four benchmarks.

图片

Prerequisites

  • python 3.6
  • pytorch>=1.6.0
  • torchvision
  • CUDA>=9.0
  • others (opencv-python etc.)

Preparation

  1. Clone this repository.

  2. Data preparation.

    Download Flickr30K Entities from Flickr30k Entities (bryanplummer.com) and Flickr30K

    Download MSCOCO images from MSCOCO

    Download processed indexes from Gdrive, process by zyang-ur .

  3. Download backbone weights. We use resnet-50/101 as the basic visual encoder. The weights are pretrained on MSCOCO, and can be downloaded here (BaiduDrive):

    ResNet-50(code:ru8v); ResNet-101(code:0hgu).

  4. Organize all files like this:

.
├── main.py
├── store
│   ├── data
   ├── flickr
   │   ├── corpus.pth
   │   └── flickr_train.pth
   ├── gref
   └── gref_umd
│   ├── ln_data
   ├── Flickr30k
   │   └── flickr30k-images
   └── other
       └── images
│   ├── pretrained
│   │   └── flickr_R50.pth.tar
│   └── pth
│       └── resnet50_detr.pth
└── work

Model Zoo

DatasetBackboneAccuracyPretrained Model (BaiduDrive)
Flickr30K EntitesResnet5074.17flickr_R50.pth.tar code: rpdr
Flickr30K EntitesResnet10175.32flickr_R101.pth.tar code: 1igb
RefCOCOResnet5078.70 82.09 73.31refcoco_R50.pth.tar code: xjs8
RefCOCOResnet10179.30 82.16 74.38refcoco_R101.pth.tar code: bv0z
RefCOCO+Resnet5063.57 69.65 55.33refcoco+_R50.pth.tar code: 521n
RefCOCO+Resnet10164.40 70.85 55.84refcoco+_R101.pth.tar code: vzld
RefCOCOgResnet5062.88refcocog_R50.pth.tar code: wb3x
RefCOCOgResnet10164.05refcocog_R101.pth.tar code: 5ok2
RefCOCOg-umdResnet5065.62 65.30umd_R50.pth.tar code: 9lzr
RefCOCOg-umdResnet10166.83 67.28umd_R101.pth.tar code: zen0

Train

python main.py \
   --gpu $gpu_id \
   --dataset $[refcoco | refcoco+ | others] \
   --batch_size $bs \
   --savename $exp_name \
   --backbone $[resnet50 | resnet101] \
   --cnn_path $resnet_coco_weight_path

Inference

Download the pretrained models and put it into the folder ./store/pretrained/.

python main.py \
   --test \
   --gpu $gpu_id \
   --dataset $[refcoco | refcoco+ | others] \
   --batch_size $bs \
   --pretrain $pretrained_weight_path

Acknowledgements

Part of codes are from:

  1. facebookresearch/detr
  2. zyang-ur/onestage_grounding
  3. andfoy/refer
  4. jadore801120/attention-is-all-you-need-pytorch.

Citation

@article{du2021visual,
  title={Visual grounding with transformers},
  author={Du, Ye and Fu, Zehua and Liu, Qingjie and Wang, Yunhong},
  journal={arXiv preprint arXiv:2105.04281},
  year={2021}
}

@inproceedings{du2022visual,
  title={Visual grounding with transformers},
  author={Du, Ye and Fu, Zehua and Liu, Qingjie and Wang, Yunhong},
  booktitle={Proceedings of the International Conference on Multimedia and Expo},
  year={2022}
}