SOV-STG-VLA

August 18, 2025 · View on GitHub

Focusing on what to decode and what to train: SOV Decoding with Specific Target Guided DeNoising and Vision Language Advisor (paper)

Requirements

conda create -n sov-stg-vla python=3.10 -y ; conda activate sov-stg-vla
pip install uv

uv pip install torch==1.13.1+cu116 torchvision==0.14.1+cu116 torchaudio==0.13.1 --extra-index-url https://download.pytorch.org/whl/cu116

uv pip install -r requirements.txt

uv pip install "numpy<2"
uv pip install timm==0.9.16
uv pip install transformers==4.39.1
uv pip install fairscale==0.4.13
uv pip install omegaconf==2.3.0
uv pip install wandb
uv pip install git+https://github.com/openai/CLIP.git

Compiling CUDA operators

cd ./models/ops/deformable_transformer_attention/
sh ./make.sh
# test
python test.py

Dataset Preparation

HICO-DET

Please follow the HICO-DET dataset preparation of GGNet. See the README.md of QAHOI.

After preparation, the data/hico_det folder as follows:

data
├── hico_det
|   ├── images
|   |   ├── test2015
|   |   └── train2015
|   └── annotations
|       ├── anno_list.json
|       ├── corre_hico.npy
|       ├── file_name_to_obj_cat.json
|       ├── hoi_id_to_num.json
|       ├── hoi_list_new.json
|       ├── test_hico.json
|       └── trainval_hico.json
|

V-COCO

Please follow the installation of V-COCO.

For evaluation, please put vcoco_test.ids and vcoco_test.json into data/v-coco/data folder.

After preparation, the data/v-coco folder as follows:

data
├── v-coco
|   ├── prior.pickle
|   ├── images
|   |   ├── train2014
|   |   └── val2014
|   ├── data
|   |   ├── instances_vcoco_all_2014.json
|   |   ├── vcoco_test.ids
|   |   └── vcoco_test.json
|   └── annotations
|       ├── corre_vcoco.npy
|       ├── test_vcoco.json
|       └── trainval_vcoco.json

Evaluation

HICO-DET

Model	Full (def)	Rare (def)	None-Rare (def)	Full (ko)	Rare (ko)	None-Rare (ko)	ckpt
SOV-STG-VLA-S	41.16	39.48	41.67	43.81	42.63	44.17	checkpoint
SOV-STG-VLA-Swin-L	45.64	44.35	46.03	48.22	47.12	48.55	checkpoint

V-COCO

Model	AP (S1)	AP (S2)	ckpt
SOV-STG-VLA-S	63.8	65.7	checkpoint

Download the model into params folder. Evaluating the model by running the following command.

# SOV-STG-VLA-S (HICO-DET)
sh configs/sov-stg-s-vla-blip2_hoi_eval.sh

# SOV-STG-VLA-Swin-L (HICO-DET)
sh configs/sov-stg-swin-l-vla-blip2_hoi_eval.sh

# SOV-STG-VLA-S (V-COCO)
sh configs/vcoco_sov-stg-s-vla-blip2_hoi_eval.sh

Training

HICO-DET

Training SOV-STG-VLA with Swin-Large.

Download our pre-trained DN-Deformable-DETR swin-Large model from Google Drive to params folder.

sh configs/sov-stg-swin-l-vla-blip2_hoi.sh

Training SOV-STG-VLA-S

Download our converted DN-Deformable-DETR R50 model from Google Drive to params folder.

sh configs/sov-stg-s-vla-blip2_hoi.sh

V-COCO

Train SOV-STG-VLA-S on V-COCO.

Download our converted DN-Deformable-DETR R50 model from Google Drive to params folder.

sh configs/vcoco_sov-stg-s-vla-blip2_hoi.sh

References

@inproceedings{chen2025focusing,
  title={Focusing on what to Decode and what to Train: SOV Decoding with Specific Target Guided DeNoising and Vision Language Advisor},
  author={Chen, Junwen and Wang, Yingcheng and Yanai, Keiji},
  booktitle={2025 IEEE/CVF Winter Conference on Applications of Computer Vision (WACV)},
  pages={9416--9425},
  year={2025},
  organization={IEEE}
}

@article{chen2023focusing,
  title={Focusing on what to decode and what to train: SOV Decoding with Specific Target Guided DeNoising and Vision Language Advisor},
  author={Chen, Junwen and Wang, Yingcheng and Yanai, Keiji},
  journal={arXiv preprint arXiv:2307.02291},
  year={2023}
}