Bidirectional Retrieval Made Simple

June 21, 2018 · View on GitHub

Code for our CVPR"18 paper Bidirectional Retrieval Made Simple. Given that the original code from our work cannot be publicly shared, we adapted the code from VSE++ in order to provide a public version.

Overview:

  1. Summary
  2. Results
  3. Getting started
  4. Train new models
  5. Evaluate models
  6. Citation
  7. License

Summary

Code for training and evaluating our novel CHAIN-VSE models for efficient multimodal retrieval (image annotation and caption retrieval). In summary, CHAIN-VSE uses convolutional layers directly over character-level inputs fully replacing the use of RNNs and word-embeddings. Despite being lighter and conceptually much simpler, those models achieve state-of-the-art results in MS COCO and in some text classification datasets.

chain noiseparam

Highlights

  • Independent from word-embeddings and RNNs
  • Naturally suited for multi-language scenarios without increase of memory requirements due to larger vocabulary
  • Much more robust to input noise
  • Fewer parameters
  • Simple, yet effective

Bidirectional Retrival Results

Results achieved using this repository (COCO-1k test set) using pre-computed features (note that we do not finetune the network in this experiment):

MethodFeaturesR@1R@10R@1R@10
RFF-net [baseline@ICCV"17]ResNet15256.4091.5043.9088.60
chain-v1 (p=1, d=1024)resnet152_precomp57.8095.6044.1890.66
chain-v1 (p=1, d=2048)resnet152_precomp59.9094.8045.0890.54
chain-v1 (p=1, d=8192)resnet152_precomp61.2095.8046.6090.92

Getting Started

For getting started you will need to setup your environment and download the required data.

Dependencies

We recommended to use Anaconda for the following packages.

import nltk
nltk.download()
> d punkt

Download data

Pre-computed features:

wget http://lsa.pucrs.br/jonatas/seam-data/irv2_precomp.tar.gz
wget http://lsa.pucrs.br/jonatas/seam-data/resnet152_precomp.tar.gz
wget http://lsa.pucrs.br/jonatas/seam-data/vocab.tar.gz
  • The directory of the *_precomp.tar.gz files are referred as $DATA_PATH
  • Extract vocab.tar.gz to ./vocab directory (required for baselines only).

Training new models

Run train.py:

To train CHAIN-VSE (p=1, d=2048) using resnet152_precomp features, run:

python train.py \
--data_path "$DATA_PATH" \
--data_name resnet152_precomp \
--logger_name runs/chain-v1/resnet152_precomp/  \
--text_encoder chain-v1 \
--embed_size 2048 \
--vocab_path char

Evaluate pre-trained models

from vocab import Vocabulary
import evaluation
evaluation.evalrank("$RUN_PATH/model_best.pth.tar", data_path="$DATA_PATH", split="test")'

To evaluate in COCO-1cv test set, pass fold5=True with a model trained using --data_name coco.

Citation

If you found this code/paper useful, please cite the following papers:

@InProceedings{wehrmann2018cvpr,
author = {Wehrmann, Jônatas and Barros, Rodrigo C.},
title = {Bidirectional Retrieval Made Simple},
booktitle = {The IEEE Conference on Computer Vision and Pattern Recognition (CVPR)},
month = {June},
year = {2018}
}

@article{faghri2017vse++,
  title={VSE++: Improving Visual-Semantic Embeddings with Hard Negatives},
  author={Faghri, Fartash and Fleet, David J and Kiros, Jamie Ryan and Fidler, Sanja},
  journal={arXiv preprint arXiv:1707.05612},
  year={2017}
}

License

Apache License 2.0