SOV-STG-VLA
August 18, 2025 · View on GitHub
Focusing on what to decode and what to train: SOV Decoding with Specific Target Guided DeNoising and Vision Language Advisor (paper)
Requirements
conda create -n sov-stg-vla python=3.10 -y ; conda activate sov-stg-vla
pip install uv
uv pip install torch==1.13.1+cu116 torchvision==0.14.1+cu116 torchaudio==0.13.1 --extra-index-url https://download.pytorch.org/whl/cu116
uv pip install -r requirements.txt
uv pip install "numpy<2"
uv pip install timm==0.9.16
uv pip install transformers==4.39.1
uv pip install fairscale==0.4.13
uv pip install omegaconf==2.3.0
uv pip install wandb
uv pip install git+https://github.com/openai/CLIP.git
- Compiling CUDA operators
cd ./models/ops/deformable_transformer_attention/
sh ./make.sh
# test
python test.py
Dataset Preparation
HICO-DET
Please follow the HICO-DET dataset preparation of GGNet. See the README.md of QAHOI.
After preparation, the data/hico_det folder as follows:
data
├── hico_det
| ├── images
| | ├── test2015
| | └── train2015
| └── annotations
| ├── anno_list.json
| ├── corre_hico.npy
| ├── file_name_to_obj_cat.json
| ├── hoi_id_to_num.json
| ├── hoi_list_new.json
| ├── test_hico.json
| └── trainval_hico.json
|
V-COCO
Please follow the installation of V-COCO.
For evaluation, please put vcoco_test.ids and vcoco_test.json into data/v-coco/data folder.
After preparation, the data/v-coco folder as follows:
data
├── v-coco
| ├── prior.pickle
| ├── images
| | ├── train2014
| | └── val2014
| ├── data
| | ├── instances_vcoco_all_2014.json
| | ├── vcoco_test.ids
| | └── vcoco_test.json
| └── annotations
| ├── corre_vcoco.npy
| ├── test_vcoco.json
| └── trainval_vcoco.json
Evaluation
HICO-DET
| Model | Full (def) | Rare (def) | None-Rare (def) | Full (ko) | Rare (ko) | None-Rare (ko) | ckpt |
|---|---|---|---|---|---|---|---|
| SOV-STG-VLA-S | 41.16 | 39.48 | 41.67 | 43.81 | 42.63 | 44.17 | checkpoint |
| SOV-STG-VLA-Swin-L | 45.64 | 44.35 | 46.03 | 48.22 | 47.12 | 48.55 | checkpoint |
V-COCO
| Model | AP (S1) | AP (S2) | ckpt |
|---|---|---|---|
| SOV-STG-VLA-S | 63.8 | 65.7 | checkpoint |
Download the model into params folder.
Evaluating the model by running the following command.
# SOV-STG-VLA-S (HICO-DET)
sh configs/sov-stg-s-vla-blip2_hoi_eval.sh
# SOV-STG-VLA-Swin-L (HICO-DET)
sh configs/sov-stg-swin-l-vla-blip2_hoi_eval.sh
# SOV-STG-VLA-S (V-COCO)
sh configs/vcoco_sov-stg-s-vla-blip2_hoi_eval.sh
Training
HICO-DET
- Training SOV-STG-VLA with Swin-Large.
Download our pre-trained DN-Deformable-DETR swin-Large model from Google Drive to params folder.
sh configs/sov-stg-swin-l-vla-blip2_hoi.sh
- Training SOV-STG-VLA-S
Download our converted DN-Deformable-DETR R50 model from Google Drive to params folder.
sh configs/sov-stg-s-vla-blip2_hoi.sh
V-COCO
- Train SOV-STG-VLA-S on V-COCO.
Download our converted DN-Deformable-DETR R50 model from Google Drive to params folder.
sh configs/vcoco_sov-stg-s-vla-blip2_hoi.sh
References
@inproceedings{chen2025focusing,
title={Focusing on what to Decode and what to Train: SOV Decoding with Specific Target Guided DeNoising and Vision Language Advisor},
author={Chen, Junwen and Wang, Yingcheng and Yanai, Keiji},
booktitle={2025 IEEE/CVF Winter Conference on Applications of Computer Vision (WACV)},
pages={9416--9425},
year={2025},
organization={IEEE}
}
@article{chen2023focusing,
title={Focusing on what to decode and what to train: SOV Decoding with Specific Target Guided DeNoising and Vision Language Advisor},
author={Chen, Junwen and Wang, Yingcheng and Yanai, Keiji},
journal={arXiv preprint arXiv:2307.02291},
year={2023}
}