Models

January 31, 2022 ยท View on GitHub

Pretrained Models

We distribute models pretrained on Conceptual Captions. We share ViLBERT, LXMERT and VL-BERT pretrained as originally presented in their papers, as well as the weights for ViLBERT, LXMERT, VL-BERT, VisualBERT and UNITER pretrained in our controlled setup. For the latter, we distribute the weights that lead to higher average downstream performance when fine-tuned once.

ModelVQAv2RefCOCO+NLVR2Flickr30k IRFlickr30k TR
ViLBERT66.6870.4974.2658.9075.50
LXMERT67.9871.58
VL-BERT67.4471.00
ViLBERT (CTRL)68.9770.5372.2460.3478.80
LXMERT (CTRL)67.5270.4971.0958.6274.90
VL-BERT (CTRL)68.2371.2373.2257.6270.90
VisualBERT (CTRL)69.0370.0272.7061.4875.20
UNITER (CTRL)68.6771.4573.7360.5476.40

Models Definition

Models are defined in configuration files (see config/ for some examples). Rather than using Transformer layers, we specify attention and feed-forward sub-layers for each modality, which allows to quickly extend proposed architectures. In particular, the following sub-layers are defined:

  • tt_attn_sublayers: text-text attention sub-layers
  • tv_attn_sublayers: text-vision attention sub-layers (text used as query, vision as context)
  • vt_attn_sublayers: vision-text attention sub-layers (vision used as query, text as context)
  • vv_attn_sublayers: vision-vision attention sub-layers
  • t_ff_sublayers: feed-forward sub-layers for the text modality
  • v_ff_sublayers: feed-forward sub-layers for the vision modality

In addition, the following parameters allow to tune parameter sharing across modalities:

  • shared_sublayers: sub-layers that share parameters between modalities
  • single_ln_sublayers: sub-layers in which text and vision tensors are concatenated and fed into a single LN layer

Finally, bert_layer2attn_sublayer and bert_layer2ff_sublayer are used to load text-only BERT layers into VOLTA ones.

The following figure shows how these sub-layers are used to construct ViLBERT: