ResVG
November 19, 2024 · View on GitHub
This repository is the official Pytorch implementation for paper ResVG: Enhancing Relation and Semantic Understanding in Multiple Instances for Visual Grounding.
In this paper, we propose a novel approach, the Relation and Semantic-sensitive Visual Grounding (ReSVG) model, to improve the model's understanding of relation and semantic in multiple instances. Firstly, we enhance the model's understanding of fine-grained semantics by injecting semantic prior information derived from text queries into the model. This is achieved by leveraging text-to-image generation models to produce images representing the semantic attributes of target objects described in queries. Secondly, we tackle the lack of training samples with multiple distractions by introducing a relation-sensitive data augmentation method. This method generates additional training data by synthesizing images containing multiple objects of the same category and pseudo queries based on their spatial relationships. The proposed ReSVG model significantly improves the model's ability to comprehend both object semantics and spatial relations, leading to enhanced performance in visual grounding tasks, particularly in scenarios with multiple-instance distractions.
Our paper was accepted by ACM-MM 2024.
[Paper] [Project Page] [Video]

Contents
Usage
Dependencies
- Python 3.9.10
- PyTorch 1.9.0 + cu111 + cp39
- Check requirements.txt for other dependencies.
Data Preparation
You can download the images follow TransVG and place them in ./ln_data folder:
The training samples can be download from data. Finally, the ./data/ folder will have the following structure:
|-- data
|-- flickr
|-- gref
|-- gref_umd
|-- referit
|-- unc
|-- unc+
Pretrained Checkpoints
1.You can download the DETR checkpoints from detr_checkpoints. These checkpoints should be downloaded and move to the checkpoints directory.
mkdir pretrained_checkpoints
mv detr_checkpoints.tar.gz ./pretrained_checkpoints/
tar -zxvf detr_checkpoints.tar.gz
Training and Evaluation
-
Training on RefCOCO.
CUDA_VISIBLE_DEVICES=0,1,2,3 python -m torch.distributed.launch --nproc_per_node=4 --use_env train.py --config configs/ResVG_R50_unc.py --test_split val -
Evaluation on RefCOCO.
python -m torch.distributed.launch --nproc_per_node=4 --use_env test.py --config configs/ResVG_R50_unc.py --checkpoint ResVG_R50_unc.pth --batch_size_test 32 --test_split testA;
Results
| RefCOCO | RefCOCO+ | RefCOCOg | ReferItGame | Flickr30K | ||||||
|---|---|---|---|---|---|---|---|---|---|---|
| val | testA | testB | val | testA | testB | g-val | u-val | u-test | test | test |
| 85.51 | 88.76 | 79.93 | 73.95 | 79.53 | 64.88 | 73.13 | 75.77 | 74.53 | 72.35 | 79.52 |