QueryMatch
April 9, 2025 · View on GitHub
This is the official implementation of "QueryMatch: A Query-based Contrastive Learning Framework for Weakly Supervised Visual Grounding". In this paper, we propose a novel query-based one-stage framework for weakly supervised visual grounding, namely QueryMatch, Different from previous work, QueryMatch represents candidate objects with a set of query features, which inherently establish accurate one-to-one associations with visual objects. In this case, QueryMatch re-formulates weakly supervised visual grounding as a query-text matching problem, which can be optimized via the query-based contrastive learning. Based on QueryMatch we further propose an innovative strategy for effective weakly supervised learning, namely Active Query Selection (AQS). In particular, AQS aims to enhance the effectiveness of query-based contrastive learning by actively selecting high-quality query features.
Changes
2025/04: Optimized memory usage in the AQS implementation, improved some code, and released the trained QueryMatch weights.2024/07: Our paper was accepted for ACM MM 2024.2024/04: The repository was initially created.
Installation
- Clone this repo
git clone https://github.com/TensorThinker/QueryMatch.git
cd QueryMatch
- Create a conda virtual environment and activate it
conda create -n querymatch python=3.8 -y
conda activate querymatch
- Install Pytorch following the official installation instructions
# CUDA 11.7
pip install torch==2.0.1 torchvision==0.15.2 torchaudio==2.0.2
- Install detectron following the official installation instructions
git clone https://github.com/facebookresearch/detectron2.git
python -m pip install -e detectron2
- Install apex following the official installation guide
pip install -v --disable-pip-version-check --no-build-isolation --no-cache-dir ./
- Compile the DCN layer:
cd utils_querymatch/DCN
./make.sh
cd mask2former
pip install -r requirements.txt
cd ./modeling/pixel_decoder/ops
sh make.sh
wget https://github.com/explosion/spacy-models/releases/download/en_vectors_web_lg-2.1.0/en_vectors_web_lg-2.1.0.tar.gz -O en_vectors_web_lg-2.1.0.tar.gz
pip install en_vectors_web_lg-2.1.0.tar.gz
pip install albumentations
pip install Pillow==9.5.0
pip install tensorboardX
Data Preparation
- Download images and Generate annotations according to SimREC.
- Download the pretrained weights of Mask2former from OneDrive.
- The project structure should look like the following:
| -- QueryMatch
| -- data
| -- anns
| -- refcoco.json
| -- refcoco+.json
| -- refcocog.json
| -- images
| -- train2014
| -- COCO_train2014_000000000072.jpg
| -- ...
| -- config_querymatch
| -- configs
| -- datasets
| -- datasets_querymatch
| -- DCNv2_latest
| -- detectron2
| -- mask2former
| -- models_querymatch
| -- ...
- NOTE: our Mask2former is trained on COCO’s training images, excluding those in RefCOCO, RefCOCO+, and RefCOCOg’s validation+testing.
QueryMatch
Training
python train_querymatch.py --config ./config_querymatch/[DATASET_NAME].yaml --config-file ./configs/coco/instance-segmentation/swin/maskformer2_swin_base_384_bs16_50ep.yaml --eval-only MODEL.WEIGHTS [PATH_TO_MASK2FORMER_WEIGHT]
Evaluation
python test_querymatch.py --config ./config_querymatch/[DATASET_NAME].yaml --eval-weights [PATH_TO_CHECKPOINT_FILE] --config-file ./configs/coco/instance-segmentation/swin/maskformer2_swin_base_384_bs16_50ep.yaml --eval-only MODEL.WEIGHTS [PATH_TO_MASK2FORMER_WEIGHT]
Model Zoo
QueryMatch on three RES benchmark datasets
| Method | RefCOCO | RefCOCO+ | RefCOCOg | ||||
|---|---|---|---|---|---|---|---|
| val | testA | testB | val | testA | testB | val-g | |
| QueryMatch | 59.10 | 59.08 | 58.82 | 39.87 | 41.44 | 37.22 | 43.06 |
| RefCOCO_QueryMatch | RefCOCO+_QueryMatch | RefCOCOg_QueryMatch | |||||
| Method | RefCOCO | RefCOCO+ | RefCOCOg | ||||
|---|---|---|---|---|---|---|---|
| val | testA | testB | val | testA | testB | val-g | |
| QueryMatch | 66.02 | 66.00 | 65.48 | 44.76 | 46.72 | 41.50 | 48.47 |
| RefCOCO_QueryMatch | RefCOCO+_QueryMatch | RefCOCOg_QueryMatch | |||||
Notes
Experimental Environment for Ours
- GPU: RTX 4090(24GB)
- CPU: 32 vCPU Intel(R) Xeon(R) Platinum 8352V CPU @ 2.10GHz
- CUDA 11.7
- torch 2.0.1
Compatibility Note
This project is compatible with multiple CUDA versions, including but not limited to CUDA 11.3. While overall performance trends remain consistent across various hardware environments, please note that specific numerical results may vary slightly.
Citation
@inproceedings{chen2024querymatch,
title={QueryMatch: A Query-based Contrastive Learning Framework for Weakly Supervised Visual Grounding},
author={Chen, Shengxin and Luo, Gen and Zhou, Yiyi and Sun, Xiaoshuai and Jiang, Guannan and Ji, Rongrong},
booktitle={Proceedings of the 32nd ACM International Conference on Multimedia},
pages={4177--4186},
year={2024}
}
Acknowledgement
Thanks a lot for the nicely organized code from the following repos