README.md
April 28, 2026 ยท View on GitHub
InstanceVG: Improving Generalized Visual Grounding with Instance-aware Joint Learning
Ming Dai1, Wenxuan Cheng1, Jiang-Jiang Liu2, Lingfeng Yang4, Zhenhua Feng3, Wankou Yang1*, Jingdong Wang2
1Southeast University ย ย 2Baidu VIS ย ย 3Jiangnan University ย ย 4Nanjing University of Science and Technology
๐ข News
- [2025.10.11] Codes, models, and datasets are now released! ๐ .
๐งฉ Abstract
Generalized visual grounding tasks, including Generalized Referring Expression Comprehension (GREC) and Segmentation (GRES), extend the classical paradigm by accommodating multi-target and non-target scenarios. While GREC focuses on coarse-level bounding box localization, GRES aims for fine-grained pixel-level segmentation.
Existing approaches typically treat these tasks independently, ignoring the potential benefits of joint learning and cross-granularity consistency. Moreover, most treat GRES as mere semantic segmentation, lacking instance-aware reasoning between boxes and masks.
We propose InstanceVG, a multi-task generalized visual grounding framework that unifies GREC and GRES via instance-aware joint learning. InstanceVG introduces instance queries with prior reference points to ensure consistent prediction of points, boxes, and masks across granularities.
To our knowledge, InstanceVG is the first framework to jointly tackle both GREC and GRES while integrating instance-aware consistency learning. Extensive experiments on 10 datasets across 4 tasks demonstrate that InstanceVG achieves state-of-the-art performance, substantially surpassing existing methods across various evaluation metrics.
๐๏ธ Framework Overview
โ๏ธ Installation
Environment requirements
CUDA == 11.8
torch == 2.0.0
torchvision == 0.15.1
1. Install dependencies
pip install -r requirements.txt
InstanceVG depends on components from detrex and detectron2.
python -m pip install 'git+https://github.com/facebookresearch/detectron2.git'
git clone https://github.com/IDEA-Research/detrex.git
cd detrex
git submodule init && git submodule update
pip install -e .
Finally, install InstanceVG in editable mode:
pip install -e .
๐งฎ Data Preparation
Prepare the MS-COCO dataset and download the referring and foreground annotations from the HF-Data.
Expected directory structure:
data/
โโโ seqtr_type/
โโโ annotations/
โ โโโ mixed-seg/
โ โ โโโ instances_nogoogle_withid.json
โ โโโ grefs/instance.json
โ โโโ ref-zom/instance.json
โ โโโ rrefcoco/instance.json
โโโ images/
โโโ mscoco/
โโโ train2014/
๐ง Pretrained Weights
InstanceVG uses BEiT-3 as both the backbone and multi-modal fusion module.
Download pretrained weights and tokenizer from BEiT-3โs official repository.
mkdir pretrain_weights
Place the following files:
pretrain_weights/
โโโ beit3_base_patch16_224.zip
โโโ beit3_large_patch16_224.zip
โโโ beit3.spm
๐ Demo
Example 1 โ GRES task
python tools/demo.py \
--img "asserts/imgs/Figure_1.jpg" \
--expression "three skateboard guys" \
--config "configs/gres/InstanceVG-grefcoco.py" \
--checkpoint /PATH/TO/InstanceVG-grefcoco.pth
Example 2 โ RIS task
python tools/demo.py \
--img "asserts/imgs/Figure_2.jpg" \
--expression "full half fruit" \
--config "configs/refcoco/InstanceVG-refcoco.py" \
--checkpoint /PATH/TO/InstanceVG-refcoco.pth
For additional options (e.g., thresholds, alternate checkpoints), see tools/demo.py.
๐งฉ Training
To train InstanceVG from scratch:
bash tools/dist_train.sh [PATH_TO_CONFIG] [NUM_GPUS]
๐ Evaluation
To reproduce reported results:
bash tools/dist_test.sh [PATH_TO_CONFIG] [NUM_GPUS] \
--load-from [PATH_TO_CHECKPOINT_FILE]
๐ Model Zoo
All pretrained checkpoints are available on Model.
| Task / Train Set | Config | Checkpoint |
|---|---|---|
| RefCOCO/+/g (Base) | configs/refcoco/InstanceVG-B-refcoco.py | InstanceVG-B-refcoco.pth |
| RefCOCO/+/g (Large) | configs/refcoco/InstanceVG-L-refcoco.py | InstanceVG-L-refcoco.pth |
| gRefCOCO | configs/gres/InstanceVG-grefcoco.py | InstanceVG-grefcoco.pth |
| Ref-ZOM | configs/refzom/InstanceVG-refzom.py | InstanceVG-refzom.pth |
| RRefCOCO | configs/rrefcoco/InstanceVG-rrefcoco.py | InstanceVG-rrefcoco.pth |
Example reproduction:
bash tools/dist_test.sh configs/refcoco/InstanceVG-B-refcoco.py 1 \
--load-from work_dir/refcoco/InstanceVG-B-refcoco.pth
๐ Citation
If you find our work useful, please cite:
@ARTICLE{instancevg,
author={Dai, Ming and Cheng, Wenxuan and Liu, Jiang-Jiang and Yang, Lingfeng and Feng, Zhenhua and Yang, Wankou and Wang, Jingdong},
journal={IEEE Transactions on Pattern Analysis and Machine Intelligence},
title={Improving Generalized Visual Grounding with Instance-aware Joint Learning},
year={2025},
doi={10.1109/TPAMI.2025.3607387}
}
@article{dai2024simvg,
title={SimVG: A Simple Framework for Visual Grounding with Decoupled Multi-Modal Fusion},
author={Dai, Ming and Yang, Lingfeng and Xu, Yihao and Feng, Zhenhua and Yang, Wankou},
journal={Advances in Neural Information Processing Systems},
volume={37},
pages={121670--121698},
year={2024}
}
@inproceedings{dai2025multi,
title={Multi-Task Visual Grounding with Coarse-to-Fine Consistency Constraints},
author={Dai, Ming and Li, Jian and Zhuang, Jiedong and Zhang, Xian and Yang, Wankou},
booktitle={Proceedings of the AAAI Conference on Artificial Intelligence},
volume={39},
number={3},
pages={2618--2626},
year={2025}
}
โญ Acknowledgements
Our implementation builds upon
We thank these excellent open-source projects for their contributions to the community.