README.md
April 15, 2026 ยท View on GitHub
Vision-Aware Text Features in Referring Image Segmentation: From Object Understanding to Context Understanding
๐ State-of-the-Art Performance
VATEX achieves state-of-the-art performance on multiple referring image segmentation benchmarks, demonstrating significant improvements over previous methods without requiring any external training data.
๐ ๏ธ Requirements & Setup
๐ฅ๏ธ System Requirements
- CUDA 12.8
- Python 3.9
- PyTorch 2.7
๐ฅ Installation
For detailed setup instructions, refer to installation.md.
๐๏ธ Data and Checkpoints
You can download our dataset as well as the reproduced training checkpoint from this Hugging Face link. After downloading, simply extract the data into the project directory to get started.
๐ Getting Started
- Download Pretrained ImageNet Models:
- Swin-B
- Swin-L
- Video-Swin-B
- Place models in the
weightsfolder.
๐๏ธโโ๏ธ Training
To train VATEX using train_net_video.py, first set up the corresponding datasets as described in data.md, then execute:
python train_net_video.py --config-file <config-path> --num-gpus <?> OUTPUT_DIR <?>
Where OUTPUT_DIR is the directory where the weights and logs will be stored. For example, train VATEX with Swin-B backbone with 2 GPUs:
python train_net_video.py --config configs/refcoco/swin/swin_base.yaml --num-gpus 2 OUTPUT_DIR results/swin_base
To resume training, simply add the flag --resume.
๐ Evaluation
To evaluate a trained model, use the following command:
python train_net_video.py --config configs/refcoco/swin/swin_base.yaml \
--num-gpus 1 --eval-only \
MODEL.WEIGHTS <path_to_weights> \
DATASETS.TEST '("refcoco_val",)' \
OUTPUT_DIR <output_dir>
๐ Reproduced Results
The following table presents the reproduced mIoU scores obtained using the released weights. These scores are slightly lower than those reported in the original paper, primarily due to differences in the software environment (PyTorch 2.7 / CUDA 12.8 versus the environment used during the initial training). Additionally, note that the numbers reported in the paper were selected from multiple runs, using the best-performing checkpoints.
| Dataset | Split | Paper (mIoU) | Reproduced (mIoU) |
|---|---|---|---|
| RefCOCO | val | 78.16 | 77.08 |
| RefCOCO | testA | 79.64 | 78.59 |
| RefCOCO | testB | 75.64 | 74.65 |
| RefCOCO+ | val | 70.02 | 69.47 |
| RefCOCO+ | testA | 74.41 | 73.82 |
| RefCOCO+ | testB | 62.52 | 61.50 |
| G-Ref | val | 69.73 | 69.54 |
| G-Ref | test | 70.58 | 70.17 |
All results use a Swin-B visual backbone and a CLIP text encoder.
๐ Main Results
As shown in the table, our method achieves remarkable performance improvements over state-of-the-art methods across all benchmarks on mIoU metrics. Notably, we surpass recent methods like CGFormer and VG-LAW by substantial margins: +1.23% and +3.11% on RefCOCO, +1.46% and +3.31% on RefCOCO+, and +2.16% and +4.37% on G-Ref validation splits respectively. The more complex the expressions, the greater the performance gains achieved by VATEX. Even compared to LISA, a large pre-trained vision-language model, VATEX consistently achieves an impressive 3-5% better performance across all datasets.
๐ Citing VATEX
If you find VATEX useful for your research, please cite the following paper:
@inproceedings{nguyen2025visionaware,
title={Vision-Aware Text Features in Referring Image Segmentation: From Object Understanding to Context Understanding},
author={Nguyen, Truong and Others},
booktitle={Proceedings of the IEEE/CVF Winter Conference on Applications of Computer Vision (WACV)},
year={2025},
url={https://openaccess.thecvf.com/content/WACV2025/html/Nguyen-Truong_Vision-Aware_Text_Features_in_Referring_Image_Segmentation_From_Object_Understanding_WACV_2025_paper.html}
}