Visual Semantic Loss (VSL)
December 25, 2023 · View on GitHub
The code for ICME2023 paper of “Image-text Retrieval via preserving main Semantics of Vision”[pdf].
We proposed a semantical alignment strategy Visual Semantic Loss(VSL) for image-text retrieval. And we verify the effectiveness on top of two models proposed in SGRAF.
Introduction
The framework of VSL:
The experiments result:
| Dataset | Method | Image to Text | Text to Image | ||||
| R@1 | R@5 | R@10 | R@1 | R@5 | R@10 | ||
| MSCOCO1K | SGR+VSL | 78.5 | 96.2 | 98.6 | 63.0 | 89.9 | 95.3 |
| SAF+VSL | 78.3 | 96.0 | 98.6 | 63.0 | 89.9 | 95.3 | |
| SGRAF+VSL | 80.1 | 96.5 | 98.8 | 64.8 | 90.7 | 95.9 | |
| MSCOCO5K | SGR+VSL | 57.7 | 84.3 | 91.0 | 41.4 | 70.5 | 80.8 |
| SAF+VSL | 56.2 | 84.4 | 91.3 | 41.4 | 70.4 | 81.0 | |
| SGRAF+VSL | 60.2 | 86.6 | 92.5 | 43.3 | 72.2 | 82.5 | |
| Flickr30K | SGR+VSL | 75.7 | 93.5 | 96.5 | 56.5 | 80.9 | 85.9 |
| SAF+VSL | 75.9 | 93.9 | 97.5 | 57.9 | 82.7 | 88.9 | |
| SGRAF+VSL | 79.5 | 95.3 | 97.9 | 60.2 | 84.3 | 89.4 | |
Requirements
- Python 3.7
- PyTorch==1.10.1
- CUDA==11.1
- NumPy==1.21.6
- TensorBoard
- h5py==3.1.0
- Punkt Sentence Tokenizer:
import nltk
nltk.download()
> d punkt
Download data and vocab
We follow SCAN and SGRAF to obtain image features and vocabularies, which can be downloaded by using:
wget https://iudata.blob.core.windows.net/scan/data.zip
wget https://iudata.blob.core.windows.net/scan/vocab.zip
Another download link is provided by SGRAF.
https://drive.google.com/drive/u/0/folders/1os1Kr7HeTbh8FajBNegW8rjJf6GIhFqC
Pre-trained models and evaluation
Put the pretrained models into "./checkpoint".
1. The evaluation for pre-trained SGR+VSL and SAF+VSL models.
Modify the model_path, data_path, vocab_path in the eval_single.py file. Then run eval_single.py:
For example:
evalrank(model_path="./checkpoint/SGR+VSL_COCO.pth.tar", data_path='./data', split="testall", fold5=True)
(For SGR+VSL and SAF+VSL) python eval_single.py
Note that fold5=True is only for evaluation on mscoco1K (5 folders average) while fold5=False for mscoco5K and flickr30K. Pretrained models and Log files can be downloaded from:
2. The evaluation for pre-trained SGRAF+VSL model.
Modify the sgr_model_path, saf_model_path, data_path, vocab_path in the eval_overall.py file. Then run eval_overall.py:
For example:
evalrank(sgr_model_path="./checkpoint/SGR+VSL_COCO.pth.tar", saf_model_path="./checkpoint/SAF+VSL_COCO.pth.tar", data_path='./data', split="testall", fold5=True)
(For SGRAF+VSL) python eval_overall.py
Note that fold5=True is only for evaluation on mscoco1K (5 folders average) while fold5=False for mscoco5K and flickr30K. Pretrained models and Log files can be downloaded from:
Training new models
Modify the data_path, vocab_path, model_name, logger_name in the opts.py file. Then run train.py:
For MSCOCO:
(For SGR+VSL) python train.py --data_name coco_precomp --batch_size 128 --num_epochs 25 --lr_update 10 --learning_rate 0.0003 --module_name SGR
(For SAF+VSL) python train.py --data_name coco_precomp --batch_size 128 --num_epochs 25 --lr_update 10 --learning_rate 0.0003 --module_name SAF
For Flickr30K:
(For SGR+VSL) python train.py --data_name f30k_precomp --batch_size 128 --num_epochs 40 --lr_update 25 --learning_rate 0.0003 --module_name SGR
(For SAF+VSL) python train.py --data_name f30k_precomp --batch_size 128 --num_epochs 30 --lr_update 15 --learning_rate 0.0003 --module_name SAF
Ablation
1. Ablation study for Data Diversity
Modify the --batch_size to 32, 64, and 128. The results on MSCOCO1K shows below.

2. Ablation study for Semantic similarity within the visual and textual modality.
Modify the code in line 501-530, model.py. The results on MSCOCO5K shows below.

Reference
If Visual Semantic Loss(VSL) is useful for you, please cite the following paper.
Since ICME2023 has published the paper, please cite this official version of the paper. : )
@inproceedings{10219570,
author={Zhang, Xu and Niu, Xinzheng and Fournier-Viger, Philippe and Dai, Xudong},
booktitle={2023 IEEE International Conference on Multimedia and Expo (ICME)},
title={Image-text Retrieval via Preserving Main Semantics of Vision},
year={2023},
pages={1967-1972},
doi={10.1109/ICME55011.2023.00337}
}