README.md

April 28, 2026 ยท View on GitHub

InstanceVG: Improving Generalized Visual Grounding with Instance-aware Joint Learning

Ming Dai1, Wenxuan Cheng1, Jiang-Jiang Liu2, Lingfeng Yang4, Zhenhua Feng3, Wankou Yang1*, Jingdong Wang2

1Southeast University ย ย  2Baidu VIS ย ย  3Jiangnan University ย ย  4Nanjing University of Science and Technology


๐Ÿ“ข News

  • [2025.10.11] Codes, models, and datasets are now released! ๐ŸŽ‰ .

๐Ÿงฉ Abstract

Generalized visual grounding tasks, including Generalized Referring Expression Comprehension (GREC) and Segmentation (GRES), extend the classical paradigm by accommodating multi-target and non-target scenarios. While GREC focuses on coarse-level bounding box localization, GRES aims for fine-grained pixel-level segmentation.

Existing approaches typically treat these tasks independently, ignoring the potential benefits of joint learning and cross-granularity consistency. Moreover, most treat GRES as mere semantic segmentation, lacking instance-aware reasoning between boxes and masks.

We propose InstanceVG, a multi-task generalized visual grounding framework that unifies GREC and GRES via instance-aware joint learning. InstanceVG introduces instance queries with prior reference points to ensure consistent prediction of points, boxes, and masks across granularities.

To our knowledge, InstanceVG is the first framework to jointly tackle both GREC and GRES while integrating instance-aware consistency learning. Extensive experiments on 10 datasets across 4 tasks demonstrate that InstanceVG achieves state-of-the-art performance, substantially surpassing existing methods across various evaluation metrics.


๐Ÿ—๏ธ Framework Overview


โš™๏ธ Installation

Environment requirements

CUDA == 11.8
torch == 2.0.0
torchvision == 0.15.1

1. Install dependencies

pip install -r requirements.txt

InstanceVG depends on components from detrex and detectron2.

python -m pip install 'git+https://github.com/facebookresearch/detectron2.git'
git clone https://github.com/IDEA-Research/detrex.git
cd detrex
git submodule init && git submodule update
pip install -e .

Finally, install InstanceVG in editable mode:

pip install -e .

๐Ÿงฎ Data Preparation

Prepare the MS-COCO dataset and download the referring and foreground annotations from the HF-Data.

Expected directory structure:

data/
โ””โ”€โ”€ seqtr_type/
    โ”œโ”€โ”€ annotations/
    โ”‚   โ”œโ”€โ”€ mixed-seg/
    โ”‚   โ”‚   โ””โ”€โ”€ instances_nogoogle_withid.json
    โ”‚   โ”œโ”€โ”€ grefs/instance.json
    โ”‚   โ”œโ”€โ”€ ref-zom/instance.json
    โ”‚   โ””โ”€โ”€ rrefcoco/instance.json
    โ””โ”€โ”€ images/
        โ””โ”€โ”€ mscoco/
            โ””โ”€โ”€ train2014/

๐Ÿง  Pretrained Weights

InstanceVG uses BEiT-3 as both the backbone and multi-modal fusion module.

Download pretrained weights and tokenizer from BEiT-3โ€™s official repository.

mkdir pretrain_weights

Place the following files:

pretrain_weights/
โ”œโ”€โ”€ beit3_base_patch16_224.zip
โ”œโ”€โ”€ beit3_large_patch16_224.zip
โ””โ”€โ”€ beit3.spm

๐Ÿš€ Demo

Example 1 โ€” GRES task

python tools/demo.py \
  --img "asserts/imgs/Figure_1.jpg" \
  --expression "three skateboard guys" \
  --config "configs/gres/InstanceVG-grefcoco.py" \
  --checkpoint /PATH/TO/InstanceVG-grefcoco.pth

Example 2 โ€” RIS task

python tools/demo.py \
  --img "asserts/imgs/Figure_2.jpg" \
  --expression "full half fruit" \
  --config "configs/refcoco/InstanceVG-refcoco.py" \
  --checkpoint /PATH/TO/InstanceVG-refcoco.pth

For additional options (e.g., thresholds, alternate checkpoints), see tools/demo.py.


๐Ÿงฉ Training

To train InstanceVG from scratch:

bash tools/dist_train.sh [PATH_TO_CONFIG] [NUM_GPUS]

๐Ÿ“Š Evaluation

To reproduce reported results:

bash tools/dist_test.sh [PATH_TO_CONFIG] [NUM_GPUS] \
  --load-from [PATH_TO_CHECKPOINT_FILE]

๐Ÿ† Model Zoo

All pretrained checkpoints are available on Model.

Task / Train SetConfigCheckpoint
RefCOCO/+/g (Base)configs/refcoco/InstanceVG-B-refcoco.pyInstanceVG-B-refcoco.pth
RefCOCO/+/g (Large)configs/refcoco/InstanceVG-L-refcoco.pyInstanceVG-L-refcoco.pth
gRefCOCOconfigs/gres/InstanceVG-grefcoco.pyInstanceVG-grefcoco.pth
Ref-ZOMconfigs/refzom/InstanceVG-refzom.pyInstanceVG-refzom.pth
RRefCOCOconfigs/rrefcoco/InstanceVG-rrefcoco.pyInstanceVG-rrefcoco.pth

Example reproduction:

bash tools/dist_test.sh configs/refcoco/InstanceVG-B-refcoco.py 1 \
  --load-from work_dir/refcoco/InstanceVG-B-refcoco.pth

๐Ÿ“š Citation

If you find our work useful, please cite:

@ARTICLE{instancevg,
  author={Dai, Ming and Cheng, Wenxuan and Liu, Jiang-Jiang and Yang, Lingfeng and Feng, Zhenhua and Yang, Wankou and Wang, Jingdong},
  journal={IEEE Transactions on Pattern Analysis and Machine Intelligence},
  title={Improving Generalized Visual Grounding with Instance-aware Joint Learning},
  year={2025},
  doi={10.1109/TPAMI.2025.3607387}
}

@article{dai2024simvg,
  title={SimVG: A Simple Framework for Visual Grounding with Decoupled Multi-Modal Fusion},
  author={Dai, Ming and Yang, Lingfeng and Xu, Yihao and Feng, Zhenhua and Yang, Wankou},
  journal={Advances in Neural Information Processing Systems},
  volume={37},
  pages={121670--121698},
  year={2024}
}

@inproceedings{dai2025multi,
  title={Multi-Task Visual Grounding with Coarse-to-Fine Consistency Constraints},
  author={Dai, Ming and Li, Jian and Zhuang, Jiedong and Zhang, Xian and Yang, Wankou},
  booktitle={Proceedings of the AAAI Conference on Artificial Intelligence},
  volume={39},
  number={3},
  pages={2618--2626},
  year={2025}
}

โญ Acknowledgements

Our implementation builds upon

We thank these excellent open-source projects for their contributions to the community.