Models

February 7, 2022 · View on GitHub

Pretrained Models

We distribute models pretrained on Conceptual Captions. We share ViLBERT, LXMERT and VL-BERT pretrained as originally presented in their papers, as well as the weights for ViLBERT, LXMERT, VL-BERT, VisualBERT and UNITER pretrained in our controlled setup. For the latter, we distribute the weights that lead to higher average downstream performance when fine-tuned once.

Model	VQAv2	RefCOCO+	NLVR2	Flickr30k IR	Flickr30k TR
ViLBERT	66.68	70.49	74.26	58.90	75.50
LXMERT	67.98		71.58
VL-BERT	67.44	71.00
ViLBERT (CTRL)	68.97	70.53	72.24	60.34	78.80
LXMERT (CTRL)	67.52	70.49	71.09	58.62	74.90
VL-BERT (CTRL)	68.23	71.23	73.22	57.62	70.90
VisualBERT (CTRL)	69.03	70.02	72.70	61.48	75.20
UNITER (CTRL)	68.67	71.45	73.73	60.54	76.40

Checkpoints by Random Seed

All the models pretrained with 10 random seeds in our controlled setup can be downloaded from here.

Conversions of Original Models into VOLTA

Model	Source
LXMERT (Original)	airsplay/lxmert

Multilingual Models

Model	XVNLI	xGQA	MaRVL	xFlickr&CO IR	xFlickr&CO TR	WIT IR	WIT TR
mUNITER	53.69	9.97	53.72	8.06	8.86	9.16	10.48
xUNITER	58.48	21.72	54.59	14.04	13.51	8.72	9.81
UC2	62.05	29.35	57.28	20.31	17.89	7.83	9.09
M3P	58.25	28.17	56.00	12.91	11.90	8.12	9.98

Models Definition

Models are defined in configuration files (see config/ for some examples). Rather than using Transformer layers, we specify attention and feed-forward sub-layers for each modality, which allows to quickly extend proposed architectures. In particular, the following sub-layers are defined:

tt_attn_sublayers: text-text attention sub-layers
tv_attn_sublayers: text-vision attention sub-layers (text used as query, vision as context)
vt_attn_sublayers: vision-text attention sub-layers (vision used as query, text as context)
vv_attn_sublayers: vision-vision attention sub-layers
t_ff_sublayers: feed-forward sub-layers for the text modality
v_ff_sublayers: feed-forward sub-layers for the vision modality

In addition, the following parameters allow to tune parameter sharing across modalities:

shared_sublayers: sub-layers that share parameters between modalities
single_ln_sublayers: sub-layers in which text and vision tensors are concatenated and fed into a single LN layer

Finally, bert_layer2attn_sublayer and bert_layer2ff_sublayer are used to load text-only BERT layers into VOLTA ones.

The following figure shows how these sub-layers are used to construct ViLBERT: