X-modaler

October 13, 2021 · View on GitHub

X-modaler is a versatile and high-performance codebase for cross-modal analytics (e.g., image captioning, video captioning, vision-language pre-training, visual question answering, visual commonsense reasoning, and cross-modal retrieval). This codebase unifies comprehensive high-quality modules in state-of-the-art vision-language techniques, which are organized in a standardized and user-friendly fashion.

The original paper can be found here.

Installation

See installation instructions.

Requiremenets

  • Linux or macOS with Python ≥ 3.6
  • PyTorch ≥ 1.8 and torchvision that matches the PyTorch installation. Install them together at pytorch.org to make sure of this
  • fvcore
  • pytorch_transformers
  • jsonlines
  • pycocotools

Getting Started

See Getting Started with X-modaler

Training & Evaluation in Command Line

We provide a script in "train_net.py", that is made to train all the configs provided in X-modaler. You may want to use it as a reference to write your own training script.

To train a model(e.g., UpDown) with "train_net.py", first setup the corresponding datasets following datasets, then run:

# Teacher Force
python train_net.py --num-gpus 4 \
 	--config-file configs/image_caption/updown.yaml

# Reinforcement Learning
python train_net.py --num-gpus 4 \
 	--config-file configs/image_caption/updown_rl.yaml

Model Zoo and Baselines

A large set of baseline results and trained models are available here.

Image Captioning
Attention Show, attend and tell: Neural image caption generation with visual attention ICML 2015
LSTM-A3 Boosting image captioning with attributes ICCV 2017
Up-Down Bottom-up and top-down attention for image captioning and visual question answering CVPR 2018
GCN-LSTM Exploring visual relationship for image captioning ECCV 2018
Transformer Conceptual captions: A cleaned, hypernymed, image alt-text dataset for automatic image captioning ACL 2018
Meshed-Memory Meshed-Memory Transformer for Image Captioning CVPR 2020
X-LAN X-Linear Attention Networks for Image Captioning CVPR 2020
Video Captioning
MP-LSTM Translating Videos to Natural Language Using Deep Recurrent Neural Networks NAACL HLT 2015
TA Describing Videos by Exploiting Temporal Structure ICCV 2015
Transformer Conceptual captions: A cleaned, hypernymed, image alt-text dataset for automatic image captioning ACL 2018
TDConvED Temporal Deformable Convolutional Encoder-Decoder Networks for Video Captioning AAAI 2019
Vision-Language Pretraining
Uniter UNITER: UNiversal Image-TExt Representation Learning ECCV 2020
TDEN Scheduled Sampling in Vision-Language Pretraining with Decoupled Encoder-Decoder Network AAAI 2021

Image Captioning on MSCOCO (Cross-Entropy Loss)

NameModelBLEU@1BLEU@2BLEU@3BLEU@4METEORROUGE-LCIDEr-DSPICE
LSTM-A3GoogleDrive75.359.045.435.026.755.6107.719.7
AttentionGoogleDrive76.460.646.936.127.656.6113.020.4
Up-DownGoogleDrive76.360.346.636.027.656.6113.120.7
GCN-LSTMGoogleDrive76.861.147.636.928.257.2116.321.2
TransformerGoogleDrive76.460.346.535.828.256.7116.621.3
Meshed-MemoryGoogleDrive76.360.246.435.628.156.5116.021.2
X-LANGoogleDrive77.561.948.337.528.657.6120.721.9
TDENGoogleDrive75.559.445.734.928.756.7116.322.0

Image Captioning on MSCOCO (CIDEr Score Optimization)

NameModelBLEU@1BLEU@2BLEU@3BLEU@4METEORROUGE-LCIDEr-DSPICE
LSTM-A3GoogleDrive77.961.546.735.027.156.3117.020.5
AttentionGoogleDrive79.463.548.937.127.957.6123.121.3
Up-DownGoogleDrive80.164.349.737.728.058.0124.721.5
GCN-LSTMGoogleDrive80.264.750.338.528.558.4127.222.1
TransformerGoogleDrive80.565.451.139.229.158.7130.023.0
Meshed-MemoryGoogleDrive80.765.551.439.629.258.9131.122.9
X-LANGoogleDrive80.465.251.039.229.459.0131.023.2
TDENGoogleDrive81.366.352.040.129.659.8132.623.4

Video Captioning on MSVD

NameModelBLEU@1BLEU@2BLEU@3BLEU@4METEORROUGE-LCIDEr-DSPICE
MP-LSTMGoogleDrive77.065.656.948.132.468.173.14.8
TAGoogleDrive80.468.960.151.033.570.077.24.9
TransformerGoogleDrive79.067.658.549.433.368.780.34.9
TDConvEDGoogleDrive81.670.461.351.734.170.477.85.0

Video Captioning on MSR-VTT

NameModelBLEU@1BLEU@2BLEU@3BLEU@4METEORROUGE-LCIDEr-DSPICE
MP-LSTMGoogleDrive73.660.849.038.626.058.341.15.6
TAGoogleDrive74.361.850.339.926.459.442.95.8
TransformerGoogleDrive75.462.350.039.226.558.744.05.9
TDConvEDGoogleDrive76.462.349.938.926.359.040.75.7

Visual Question Answering

NameModelOverallYes/NoNumberOther
UniterGoogleDrive70.186.853.759.6
TDENGoogleDrive71.988.354.362.0

Caption-based image retrieval on Flickr30k

NameModelR1R5R10
UniterGoogleDrive61.687.792.8
TDENGoogleDrive62.086.692.4

Visual commonsense reasoning

NameModelQ -> AQA -> RQ -> AR
UniterGoogleDrive73.075.355.4
TDENGoogleDrive75.076.557.7

License

X-modaler is released under the Apache License, Version 2.0.

Citing X-modaler

If you use X-modaler in your research, please use the following BibTeX entry.

@inproceedings{Xmodaler2021,
  author =       {Yehao Li, Yingwei Pan, Jingwen Chen, Ting Yao, and Tao Mei},
  title =        {X-modaler: A Versatile and High-performance Codebase for Cross-modal Analytics},
  booktitle =    {Proceedings of the 29th ACM international conference on Multimedia},
  year =         {2021}
}