Models

February 7, 2022 ยท View on GitHub

Pretrained Models

We distribute models pretrained on Conceptual Captions. We share ViLBERT, LXMERT and VL-BERT pretrained as originally presented in their papers, as well as the weights for ViLBERT, LXMERT, VL-BERT, VisualBERT and UNITER pretrained in our controlled setup. For the latter, we distribute the weights that lead to higher average downstream performance when fine-tuned once.

ModelVQAv2RefCOCO+NLVR2Flickr30k IRFlickr30k TR
ViLBERT66.6870.4974.2658.9075.50
LXMERT67.9871.58
VL-BERT67.4471.00
ViLBERT (CTRL)68.9770.5372.2460.3478.80
LXMERT (CTRL)67.5270.4971.0958.6274.90
VL-BERT (CTRL)68.2371.2373.2257.6270.90
VisualBERT (CTRL)69.0370.0272.7061.4875.20
UNITER (CTRL)68.6771.4573.7360.5476.40

Checkpoints by Random Seed

All the models pretrained with 10 random seeds in our controlled setup can be downloaded from here.

Conversions of Original Models into VOLTA

ModelSource
LXMERT (Original)airsplay/lxmert

Multilingual Models

ModelXVNLIxGQAMaRVLxFlickr&CO IRxFlickr&CO TRWIT IRWIT TR
mUNITER53.699.9753.728.068.869.1610.48
xUNITER58.4821.7254.5914.0413.518.729.81
UC262.0529.3557.2820.3117.897.839.09
M3P58.2528.1756.0012.9111.908.129.98

Models Definition

Models are defined in configuration files (see config/ for some examples). Rather than using Transformer layers, we specify attention and feed-forward sub-layers for each modality, which allows to quickly extend proposed architectures. In particular, the following sub-layers are defined:

  • tt_attn_sublayers: text-text attention sub-layers
  • tv_attn_sublayers: text-vision attention sub-layers (text used as query, vision as context)
  • vt_attn_sublayers: vision-text attention sub-layers (vision used as query, text as context)
  • vv_attn_sublayers: vision-vision attention sub-layers
  • t_ff_sublayers: feed-forward sub-layers for the text modality
  • v_ff_sublayers: feed-forward sub-layers for the vision modality

In addition, the following parameters allow to tune parameter sharing across modalities:

  • shared_sublayers: sub-layers that share parameters between modalities
  • single_ln_sublayers: sub-layers in which text and vision tensors are concatenated and fed into a single LN layer

Finally, bert_layer2attn_sublayer and bert_layer2ff_sublayer are used to load text-only BERT layers into VOLTA ones.

The following figure shows how these sub-layers are used to construct ViLBERT: