Fine-grained Image-text Matching by Cross-modal Hard Aligning Network
September 13, 2023 · View on GitHub
Introduction
This is the source code of CHAN accepted by CVPR2023. It is built on top of the VSEinf in PyTorch.
Requirements and Installation
We recommended the following dependencies.
- Python 3.7
- PyTorch 1.11.0
- Transformers (4.18.0)
- The specific required environment can be found here
Results
Results on COCO 5K Test Split
| Visual Backbone | Text Backbone | R1 | R5 | R1 | R5 | |
|---|---|---|---|---|---|---|
| CHAN | BUTD region | GRU-base | 60.2 | 85.9 | 41.7 | 71.5 |
| CHAN | BUTD region | BERT-base | 59.8 | 87.2 | 44.9 | 74.5 |
Results on Flickr30K Test Split
| Visual Backbone | Text Backbone | R1 | R5 | R1 | R5 | |
|---|---|---|---|---|---|---|
| CHAN | BUTD region | GRU-base | 79.7 | 94.5 | 60.2 | 85.3 |
| CHAN | BUTD region | BERT-base | 80.6 | 96.1 | 63.9 | 87.5 |
Chechpoints
We release our checkpoints at Google Drive and Baidu Yun.
Data
We organize all data used in the experiments in the same manner as VSEinf:
data
├── coco
│ ├── precomp # pre-computed BUTD region features for COCO, provided by SCAN
│ │ ├── train_ids.txt
│ │ ├── train_caps.txt
│ │ ├── ......
│ │
│ ├── images # raw coco images
│ │ ├── train2014
│ │ └── val2014
│ │
│ └── id_mapping.json # mapping from coco-id to image's file name
│
│
├── f30k
│ ├── precomp # pre-computed BUTD region features for Flickr30K, provided by SCAN
│ │ ├── train_ids.txt
│ │ ├── train_caps.txt
│ │ ├── ......
│ │
│ ├── flickr30k-images # raw coco images
│ │ ├── xxx.jpg
│ │ └── ...
│ └── id_mapping.json # mapping from f30k index to image's file name
│
│
└── vocab # vocab files provided by SCAN (only used when the text backbone is BiGRU)
The download links for original COCO/F30K images, precomputed BUTD features, and corresponding vocabularies are from the offical repo of SCAN. The precomp folders contain pre-computed BUTD region features, data/coco/images contains raw MS-COCO images, and data/f30k/flickr30k-images contains raw Flickr30K images.
(Update: It seems that the download link for the pre-computed features in SCAN's repo is down, this Dropbox link provides a copy of these files. Please download and follow the above file structures to organize the data.)
The id_mapping.json files are the mapping from image index (ie, the COCO id for COCO images) to corresponding filenames, we generated these mappings to eliminate the need of the pycocotools package.
Please download all necessary data files and organize them in the above manner, the path to the data directory will be the argument to the training script as shown below.
Training
sh scripts/train.sh
Evaluation
sh scripts/eval.sh
Reference
If you found this code useful, please cite the following paper:
@inproceedings{pan2023chan,
title={Fine-grained Image-text Matching by Cross-modal Hard Aligning Network},
author={Pan, Zhengxin and Wu, Fangyu and Zhang, Bailing},
booktitle={IEEE Conference on Computer Vision and Pattern Recognition (CVPR)},
year={2023}
}