ablations.md

December 22, 2020 · View on GitHub

Ablation studies on LSMDC

We conduct several ablation studies to investigate the importance of different components in the Collaborative Experts design. Each ablation is conducted on the LSMDC dataset.

CE Design: First, we investigate the importance of the parts used by the CE model.

Model	Task	R@1	R@5	R@10	MdR	Params	Links
Concat	t2v	_{^0.5_(0.5)}	_{^2.4_(0.9)}	_{^4.9_(1.7)}	_{^195.8_(22.1)}	150.63k	config, model, log
CE - MW,P,CG	t2v	_{^9.7_(0.3)}	_{^24.0_(0.9)}	_{^33.3_(0.5)}	_{^29.5_(2.5)}	159.78M	config, model, log
CE - P,CG	t2v	_{^11.2_(0.7)}	_{^26.1_(0.9)}	_{^35.3_(0.4)}	_{^25.7_(2.1)}	159.78M	config, model, log
CE - CG	t2v	_{^10.6_(0.4)}	_{^26.9_(0.5)}	_{^35.3_(0.5)}	_{^27.2_(1.8)}	114.48M	config, model, log
CE	t2v	_{^11.2_(0.4)}	_{^26.9_(1.1)}	_{^34.8_(2.0)}	_{^25.3_(3.1)}	116.86M	config, model, log
Concat	v2t	_{^0.6_(0.4)}	_{^3.1_(0.7)}	_{^5.7_(1.3)}	_{^184.7_(20.7)}	150.63k	config, model, log
CE - MW,P,CG	v2t	_{^11.1_(1.0)}	_{^24.5_(0.9)}	_{^32.5_(0.8)}	_{^32.2_(1.9)}	159.78M	config, model, log
CE - P,CG	v2t	_{^11.5_(0.2)}	_{^25.3_(1.0)}	_{^34.3_(0.3)}	_{^28.0_(3.0)}	159.78M	config, model, log
CE - CG	v2t	_{^10.8_(0.2)}	_{^26.1_(0.6)}	_{^34.0_(0.8)}	_{^28.8_(0.3)}	114.48M	config, model, log
CE	v2t	_{^11.7_(0.5)}	_{^25.8_(1.5)}	_{^34.4_(1.7)}	_{^28.0_(2.6)}	116.86M	config, model, log

Each row adds an additional component to the model. The names refer to the following model designs:

Concat: A barebones concatenation model. After aggregating each expert across time (which still requires some parameters for the variable-length VLAD layers), the experts are concatenated and compared directly against the aggregated text embeddings. Note: this model uses a slightly greater number of VLAD clusters than the others to allow the concatentated embedding to match the dimensionality of the text.
CE - MW,P,CG - The CE model without MoE weights, projecting to a common dimension or Collaborative Gating.
CE - P,CG - The CE model without projecting to a common dimension or Collaborative Gating (note that this is equivalent to the MoEE model proposed in [2]).
CE - CG - The CE model without Collaborative Gating (CG).
CE - The full CE model.

Note that in the table above some metrics have been removed to allow the number of parameters to be displayed---these additional metrics can be found in the linked logs.

Importance of Different Experts: The next ablation investigates the value of each of the different experts towards the final embedding. Since not all experts are available in every video, we pair each expert with scene features, to give an approximation of their usefulness.

Experts	Task	R@1	R@5	R@10	MdR	Params	Links
Scene	t2v	_{^4.2_(0.4)}	_{^12.6_(0.4)}	_{^18.6_(0.6)}	_{^85.5_(0.5)}	12.83M	config, model, log
Scene + Inst.	t2v	_{^8.0_(1.9)}	_{^19.4_(1.5)}	_{^27.1_(1.1)}	_{^45.2_(3.3)}	27.87M	config, model, log
Scene + r2p1d	t2v	_{^8.0_(0.3)}	_{^20.5_(0.6)}	_{^29.1_(1.0)}	_{^39.8_(2.1)}	26.69M	config, model, log
Scene + RGB	t2v	_{^5.9_(0.5)}	_{^16.3_(0.4)}	_{^22.4_(0.3)}	_{^59.3_(2.5)}	27.87M	config, model, log
Scene + Flow	t2v	_{^6.4_(0.9)}	_{^18.2_(0.9)}	_{^26.2_(1.0)}	_{^48.5_(0.9)}	27.09M	config, model, log
Scene + Audio	t2v	_{^6.3_(0.1)}	_{^15.9_(0.6)}	_{^24.4_(0.5)}	_{^50.7_(1.5)}	26.6M	config, model, log
Scene + OCR	t2v	_{^3.8_(0.8)}	_{^11.9_(0.7)}	_{^17.5_(1.3)}	_{^88.8_(9.0)}	33.23M	config, model, log
Scene + Speech	t2v	_{^3.8_(0.4)}	_{^12.3_(0.5)}	_{^18.1_(0.6)}	_{^83.8_(6.5)}	27.46M	config, model, log
Scene + Face	t2v	_{^4.4_(0.4)}	_{^12.6_(0.2)}	_{^19.2_(0.6)}	_{^83.8_(6.0)}	26.4M	config, model, log
Scene	v2t	_{^4.6_(0.4)}	_{^12.6_(1.1)}	_{^18.0_(0.9)}	_{^91.8_(2.0)}	12.83M	config, model, log
Scene + Inst.	v2t	_{^7.6_(0.9)}	_{^19.7_(0.1)}	_{^27.3_(0.9)}	_{^47.5_(1.3)}	27.87M	config, model, log
Scene + r2p1d	v2t	_{^7.9_(0.2)}	_{^20.7_(0.8)}	_{^28.0_(0.7)}	_{^42.7_(4.3)}	26.69M	config, model, log
Scene + RGB	v2t	_{^6.0_(0.2)}	_{^16.4_(0.6)}	_{^22.8_(0.8)}	_{^64.0_(4.8)}	27.87M	config, model, log
Scene + Flow	v2t	_{^6.6_(1.2)}	_{^17.4_(0.4)}	_{^24.8_(0.9)}	_{^50.2_(1.4)}	27.09M	config, model, log
Scene + Audio	v2t	_{^6.3_(0.5)}	_{^17.3_(0.5)}	_{^23.8_(0.8)}	_{^56.3_(3.5)}	26.6M	config, model, log
Scene + OCR	v2t	_{^4.8_(0.6)}	_{^12.4_(0.7)}	_{^17.6_(0.4)}	_{^96.7_(16.4)}	33.23M	config, model, log
Scene + Speech	v2t	_{^3.9_(0.2)}	_{^11.9_(0.1)}	_{^17.8_(0.3)}	_{^84.8_(4.4)}	27.46M	config, model, log
Scene + Face	v2t	_{^4.9_(0.3)}	_{^13.2_(0.4)}	_{^19.1_(1.7)}	_{^93.2_(7.5)}	26.4M	config, model, log

We can also study their cumulative effect:

Experts	Task	R@1	R@5	R@10	MdR	Params	Links
Scene	t2v	_{^4.2_(0.4)}	_{^12.6_(0.4)}	_{^18.6_(0.6)}	_{^85.5_(0.5)}	12.83M	config, model, log
Prev. + Speech	t2v	_{^3.8_(0.4)}	_{^12.3_(0.5)}	_{^18.1_(0.6)}	_{^83.8_(6.5)}	27.46M	config, model, log
Prev. + Audio	t2v	_{^6.0_(0.5)}	_{^17.2_(0.6)}	_{^24.4_(0.7)}	_{^50.5_(1.3)}	38.86M	config, model, log
Prev. + Flow	t2v	_{^7.6_(0.7)}	_{^20.1_(0.6)}	_{^28.1_(0.7)}	_{^37.7_(2.8)}	50.75M	config, model, log
Prev. + RGB	t2v	_{^8.9_(0.9)}	_{^21.8_(0.2)}	_{^29.9_(0.7)}	_{^34.5_(1.3)}	63.43M	config, model, log
Prev. + Inst	t2v	_{^10.4_(1.3)}	_{^25.6_(1.0)}	_{^34.2_(1.0)}	_{^28.3_(1.5)}	76.11M	config, model, log
Prev. + R2P1D	t2v	_{^11.3_(0.4)}	_{^27.4_(0.9)}	_{^36.3_(1.1)}	_{^25.7_(1.2)}	87.61M	config, model, log
Prev. + OCR	t2v	_{^11.5_(0.5)}	_{^26.2_(0.6)}	_{^35.8_(0.5)}	_{^25.7_(1.8)}	105.65M	config, model, log
Prev. + Face	t2v	_{^11.3_(0.3)}	_{^26.7_(1.5)}	_{^35.1_(1.6)}	_{^25.3_(3.1)}	116.86M	config, model, log
Scene	v2t	_{^4.6_(0.4)}	_{^12.6_(1.1)}	_{^18.0_(0.9)}	_{^91.8_(2.0)}	12.83M	config, model, log
Prev. + Speech	v2t	_{^3.9_(0.2)}	_{^11.9_(0.1)}	_{^17.8_(0.3)}	_{^84.8_(4.4)}	27.46M	config, model, log
Prev. + Audio	v2t	_{^7.2_(0.3)}	_{^18.2_(0.3)}	_{^24.7_(1.2)}	_{^57.0_(2.2)}	38.86M	config, model, log
Prev. + Flow	v2t	_{^7.9_(0.6)}	_{^20.5_(0.4)}	_{^28.6_(0.5)}	_{^41.3_(2.5)}	50.75M	config, model, log
Prev. + RGB	v2t	_{^9.3_(0.6)}	_{^21.9_(1.2)}	_{^30.1_(0.8)}	_{^34.8_(1.3)}	63.43M	config, model, log
Prev. + Inst.	v2t	_{^11.1_(0.8)}	_{^25.1_(1.6)}	_{^33.8_(1.0)}	_{^30.0_(1.3)}	76.11M	config, model, log
Prev. + R2P1D	v2t	_{^11.8_(0.2)}	_{^26.6_(0.9)}	_{^35.8_(0.4)}	_{^27.7_(2.1)}	87.61M	config, model, log
Prev. + OCR	v2t	_{^11.2_(0.3)}	_{^26.0_(1.3)}	_{^33.8_(0.9)}	_{^27.4_(1.2)}	105.65M	config, model, log
Prev. + Face	v2t	_{^11.6_(0.5)}	_{^25.8_(1.4)}	_{^34.4_(1.7)}	_{^27.7_(2.5)}	116.86M	config, model, log

Importance of Model Capacity: The next ablation investigates the value of the shared embedding dimension used by CE.

Dimension	Task	R@1	R@5	R@10	MdR	Params	Links
384	t2v	_{^11.1_(0.6)}	_{^26.6_(0.6)}	_{^35.8_(0.1)}	_{^26.2_(2.3)}	55.27M	config, model, log
512	t2v	_{^11.2_(0.4)}	_{^26.2_(0.8)}	_{^35.8_(0.6)}	_{^26.3_(2.1)}	75.08M	config, model, log
640	t2v	_{^11.7_(0.6)}	_{^26.9_(1.4)}	_{^36.0_(1.3)}	_{^25.3_(0.6)}	95.61M	config, model, log
768	t2v	_{^11.2_(0.4)}	_{^26.9_(1.1)}	_{^34.8_(2.0)}	_{^25.3_(3.1)}	116.86M	config, model, log
1024	t2v	_{^11.3_(0.3)}	_{^27.2_(0.9)}	_{^36.1_(1.3)}	_{^24.7_(1.5)}	161.52M	config, model, log
384	v2t	_{^11.3_(0.7)}	_{^26.0_(1.1)}	_{^34.5_(1.4)}	_{^28.8_(1.6)}	55.27M	config, model, log
512	v2t	_{^11.0_(0.5)}	_{^26.2_(0.2)}	_{^35.0_(1.5)}	_{^29.7_(1.2)}	75.08M	config, model, log
640	v2t	_{^12.0_(0.6)}	_{^26.1_(1.5)}	_{^34.1_(1.1)}	_{^28.2_(1.0)}	95.61M	config, model, log
768	v2t	_{^11.7_(0.5)}	_{^25.8_(1.5)}	_{^34.4_(1.7)}	_{^28.0_(2.6)}	116.86M	config, model, log
1024	v2t	_{^11.4_(0.9)}	_{^26.6_(0.7)}	_{^35.3_(1.1)}	_{^27.5_(3.5)}	161.52M	config, model, log

Ablation studies on DIDEMO

We conduct several ablation studies to investigate the importance of different components in the Collaborative Experts design. Each ablation is conducted on the DIDEMO dataset.

CE Design: First, we investigate the importance of the parts used by the CE model.

Model	Task	R@1	R@5	R@10	MdR	Params	Links
Concat	t2v	_{^1.6_(0.1)}	_{^7.9_(0.6)}	_{^14.2_(0.7)}	_{^65.5_(2.3)}	374.62k	config, model, log
CE - MW,P,CG	t2v	_{^14.8_(0.3)}	_{^40.4_(1.7)}	_{^54.7_(0.8)}	_{^8.7_(0.6)}	107.26M	config, model, log
CE - P,CG	t2v	_{^15.4_(0.8)}	_{^42.0_(0.2)}	_{^55.2_(0.8)}	_{^8.3_(0.6)}	107.26M	config, model, log
CE - CG	t2v	_{^14.6_(1.0)}	_{^40.1_(0.2)}	_{^54.2_(0.2)}	_{^8.7_(0.6)}	76.91M	config, model, log
CE	t2v	_{^16.1_(1.4)}	_{^41.1_(0.4)}	_{^54.4_(0.8)}	_{^8.3_(0.6)}	79.29M	config, model, log
Concat	v2t	_{^2.4_(0.5)}	_{^8.5_(0.6)}	_{^14.0_(1.0)}	_{^72.5_(4.4)}	374.62k	config, model, log
CE - MW,P,CG	v2t	_{^15.9_(0.4)}	_{^41.1_(0.5)}	_{^54.9_(0.7)}	_{^8.3_(0.6)}	107.26M	config, model, log
CE - P,CG	v2t	_{^16.2_(0.3)}	_{^41.4_(0.2)}	_{^54.4_(0.9)}	_{^8.3_(0.6)}	107.26M	config, model, log
CE - CG	v2t	_{^15.7_(0.9)}	_{^39.6_(0.8)}	_{^53.7_(0.9)}	_{^9.0_(0.0)}	76.91M	config, model, log
CE	v2t	_{^15.6_(1.3)}	_{^40.9_(0.4)}	_{^55.2_(0.5)}	_{^8.2_(0.3)}	79.29M	config, model, log

Each row adds an additional component to the model. The names refer to the following model designs:

Concat: A barebones concatenation model. After aggregating each expert across time (which still requires some parameters for the variable-length VLAD layers), the experts are concatenated and compared directly against the aggregated text embeddings. Note: this model uses a slightly greater number of VLAD clusters than the others to allow the concatentated embedding to match the dimensionality of the text.
CE - MW,P,CG - The CE model without MoE weights, projecting to a common dimension or Collaborative Gating.
CE - P,CG - The CE model without projecting to a common dimension or Collaborative Gating (note that this is equivalent to the MoEE model proposed in [2]).
CE - CG - The CE model without Collaborative Gating (CG).
CE - The full CE model.

Note that in the table above some metrics have been removed to allow the number of parameters to be displayed---these additional metrics can be found in the linked logs.

Experts	Task	R@1	R@5	R@10	MdR	Params	Links
Scene	t2v	_{^6.7_(1.3)}	_{^20.8_(0.7)}	_{^32.7_(0.9)}	_{^25.0_(1.0)}	7.62M	config, model, log
Scene + Inst.	t2v	_{^12.1_(0.6)}	_{^31.7_(0.2)}	_{^46.0_(1.3)}	_{^13.0_(1.0)}	17.47M	config, model, log
Scene + r2p1d	t2v	_{^13.0_(0.8)}	_{^36.8_(1.0)}	_{^52.0_(0.3)}	_{^10.0_(0.0)}	16.29M	config, model, log
Scene + RGB	t2v	_{^8.7_(1.6)}	_{^26.4_(1.0)}	_{^39.7_(0.4)}	_{^16.8_(0.8)}	17.47M	config, model, log
Scene + Flow	t2v	_{^10.1_(0.7)}	_{^29.4_(1.1)}	_{^42.9_(0.8)}	_{^14.5_(0.9)}	16.68M	config, model, log
Scene + Audio	t2v	_{^6.3_(0.3)}	_{^23.0_(0.8)}	_{^33.6_(0.7)}	_{^22.7_(1.3)}	17.47M	config, model, log
Scene + OCR	t2v	_{^6.1_(0.6)}	_{^20.5_(0.7)}	_{^32.3_(1.1)}	_{^25.7_(1.2)}	18.9M	config, model, log
Scene + Speech	t2v	_{^6.1_(0.5)}	_{^21.0_(1.0)}	_{^30.7_(0.4)}	_{^27.2_(0.3)}	28.6M	config, model, log
Scene + Face	t2v	_{^6.4_(0.1)}	_{^20.8_(1.2)}	_{^31.9_(0.7)}	_{^24.8_(1.8)}	16.29M	config, model, log
Scene	v2t	_{^6.6_(0.7)}	_{^21.3_(0.4)}	_{^33.0_(0.4)}	_{^25.2_(2.5)}	7.62M	config, model, log
Scene + Inst.	v2t	_{^12.5_(1.3)}	_{^33.0_(0.5)}	_{^46.1_(0.6)}	_{^13.3_(1.2)}	17.47M	config, model, log
Scene + r2p1d	v2t	_{^13.5_(0.6)}	_{^36.2_(0.2)}	_{^51.3_(1.3)}	_{^10.3_(0.6)}	16.29M	config, model, log
Scene + RGB	v2t	_{^8.5_(1.0)}	_{^26.8_(1.4)}	_{^38.8_(0.7)}	_{^16.7_(0.6)}	17.47M	config, model, log
Scene + Flow	v2t	_{^11.3_(0.2)}	_{^29.8_(0.6)}	_{^42.1_(1.2)}	_{^15.3_(1.5)}	16.68M	config, model, log
Scene + Audio	v2t	_{^6.6_(0.4)}	_{^23.0_(2.3)}	_{^33.6_(1.0)}	_{^22.2_(2.0)}	17.47M	config, model, log
Scene + OCR	v2t	_{^6.6_(0.3)}	_{^20.9_(1.2)}	_{^32.1_(0.7)}	_{^26.2_(1.8)}	18.9M	config, model, log
Scene + Speech	v2t	_{^6.8_(0.8)}	_{^21.1_(0.9)}	_{^31.4_(1.2)}	_{^27.3_(0.6)}	28.6M	config, model, log
Scene + Face	v2t	_{^6.8_(0.2)}	_{^20.7_(0.3)}	_{^32.1_(1.5)}	_{^25.7_(1.5)}	16.29M	config, model, log

We can also study their cumulative effect:

Experts	Task	R@1	R@5	R@10	MdR	Params	Links
Scene	t2v	_{^6.7_(1.3)}	_{^20.8_(0.7)}	_{^32.7_(0.9)}	_{^25.0_(1.0)}	7.62M	config, model, log
Prev. + Speech	t2v	_{^6.1_(0.5)}	_{^21.0_(1.0)}	_{^30.7_(0.4)}	_{^27.2_(0.3)}	28.6M	config, model, log
Prev. + Audio	t2v	_{^6.8_(0.1)}	_{^23.1_(0.3)}	_{^33.6_(0.5)}	_{^21.8_(0.8)}	36.09M	config, model, log
Prev. + Flow	t2v	_{^10.9_(1.1)}	_{^31.0_(1.1)}	_{^44.1_(1.1)}	_{^14.0_(1.0)}	42.79M	config, model, log
Prev. + RGB	t2v	_{^12.0_(1.1)}	_{^33.0_(0.1)}	_{^45.9_(0.4)}	_{^12.7_(0.6)}	50.27M	config, model, log
Prev. + Inst	t2v	_{^13.7_(2.1)}	_{^34.6_(1.9)}	_{^49.4_(1.0)}	_{^11.0_(1.0)}	57.76M	config, model, log
Prev. + R2P1D	t2v	_{^15.5_(0.7)}	_{^40.1_(0.2)}	_{^54.9_(1.5)}	_{^8.3_(0.6)}	64.06M	config, model, log
Prev. + OCR	t2v	_{^15.6_(0.5)}	_{^39.9_(1.1)}	_{^54.2_(1.1)}	_{^9.0_(0.0)}	72.98M	config, model, log
Prev. + Face	t2v	_{^16.1_(1.5)}	_{^41.2_(0.5)}	_{^54.5_(0.9)}	_{^8.3_(0.6)}	79.29M	config, model, log
Scene	v2t	_{^6.6_(0.7)}	_{^21.3_(0.4)}	_{^33.0_(0.4)}	_{^25.2_(2.5)}	7.62M	config, model, log
Prev. + Speech	v2t	_{^6.8_(0.8)}	_{^21.1_(0.9)}	_{^31.4_(1.2)}	_{^27.3_(0.6)}	28.6M	config, model, log
Prev. + Audio	v2t	_{^7.1_(0.3)}	_{^22.8_(0.5)}	_{^33.9_(0.2)}	_{^22.2_(0.8)}	36.09M	config, model, log
Prev. + Flow	v2t	_{^11.4_(0.5)}	_{^30.8_(0.9)}	_{^44.0_(1.0)}	_{^14.0_(1.0)}	42.79M	config, model, log
Prev. + RGB	v2t	_{^11.9_(0.2)}	_{^32.6_(1.8)}	_{^45.7_(1.6)}	_{^12.7_(0.6)}	50.27M	config, model, log
Prev. + Inst.	v2t	_{^12.9_(0.6)}	_{^36.1_(1.1)}	_{^49.2_(1.3)}	_{^11.0_(1.0)}	57.76M	config, model, log
Prev. + R2P1D	v2t	_{^15.6_(0.5)}	_{^40.3_(0.5)}	_{^54.3_(0.9)}	_{^8.7_(0.6)}	64.06M	config, model, log
Prev. + OCR	v2t	_{^16.1_(1.5)}	_{^38.6_(1.0)}	_{^53.4_(1.0)}	_{^9.0_(1.0)}	72.98M	config, model, log
Prev. + Face	v2t	_{^16.2_(1.6)}	_{^41.1_(0.7)}	_{^55.1_(0.3)}	_{^8.2_(0.3)}	79.29M	config, model, log

Importance of Model Capacity: The next ablation investigates the value of the shared embedding dimension used by CE.

Dimension	Task	R@1	R@5	R@10	MdR	Params	Links
384	t2v	_{^15.2_(0.8)}	_{^40.2_(1.5)}	_{^54.6_(1.0)}	_{^8.7_(0.6)}	36.45M	config, model, log
512	t2v	_{^15.3_(0.8)}	_{^39.9_(0.9)}	_{^53.7_(0.9)}	_{^9.0_(0.0)}	50.01M	config, model, log
640	t2v	_{^16.2_(0.6)}	_{^40.2_(0.3)}	_{^54.3_(0.2)}	_{^8.7_(0.6)}	64.29M	config, model, log
768	t2v	_{^16.1_(1.4)}	_{^41.1_(0.4)}	_{^54.4_(0.8)}	_{^8.3_(0.6)}	79.29M	config, model, log
1024	t2v	_{^15.6_(0.8)}	_{^39.3_(1.2)}	_{^53.8_(0.4)}	_{^8.7_(0.6)}	111.44M	config, model, log
384	v2t	_{^15.0_(0.2)}	_{^40.2_(1.6)}	_{^53.6_(2.5)}	_{^9.0_(1.0)}	36.45M	config, model, log
512	v2t	_{^15.6_(2.7)}	_{^39.2_(1.0)}	_{^52.4_(0.5)}	_{^9.3_(0.6)}	50.01M	config, model, log
640	v2t	_{^15.8_(0.8)}	_{^40.1_(0.1)}	_{^54.4_(0.7)}	_{^9.0_(0.0)}	64.29M	config, model, log
768	v2t	_{^15.6_(1.3)}	_{^40.9_(0.4)}	_{^55.2_(0.5)}	_{^8.2_(0.3)}	79.29M	config, model, log
1024	v2t	_{^15.6_(1.1)}	_{^39.9_(0.6)}	_{^54.1_(1.0)}	_{^8.7_(0.6)}	111.44M	config, model, log

Ablation studies on MSVD

We conduct several ablation studies to investigate the importance of different components in the Collaborative Experts design. Each ablation is conducted on the MSVD dataset.

CE Design: First, we investigate the importance of the parts used by the CE model.

Model	Task	R@1	R@5	R@10	MdR	Params	Links
Concat	t2v	_{^3.5_(0.1)}	_{^13.9_(0.2)}	_{^24.1_(0.3)}	_{^32.7_(0.6)}	314.8k	config, model, log
CE - MW,P,CG	t2v	_{^16.8_(0.1)}	_{^44.8_(0.5)}	_{^59.6_(0.6)}	_{^7.0_(0.0)}	131.37M	config, model, log
CE - P,CG	t2v	_{^18.9_(1.0)}	_{^48.1_(1.0)}	_{^63.2_(0.7)}	_{^6.0_(0.0)}	131.37M	config, model, log
CE - CG	t2v	_{^19.6_(0.5)}	_{^49.4_(0.8)}	_{^64.2_(1.0)}	_{^5.7_(0.6)}	81.67M	config, model, log
CE	t2v	_{^19.8_(0.3)}	_{^49.0_(0.3)}	_{^63.8_(0.1)}	_{^6.0_(0.0)}	84.04M	config, model, log
Concat	v2t	_{^4.0_(0.6)}	_{^14.9_(0.8)}	_{^22.4_(0.8)}	_{^42.5_(0.9)}	314.8k	config, model, log
CE - MW,P,CG	v2t	_{^21.7_(0.3)}	_{^47.6_(0.1)}	_{^58.2_(0.4)}	_{^6.4_(0.5)}	131.37M	config, model, log
CE - P,CG	v2t	_{^22.9_(2.5)}	_{^48.6_(1.2)}	_{^58.3_(1.3)}	_{^6.2_(0.4)}	131.37M	config, model, log
CE - CG	v2t	_{^23.4_(2.5)}	_{^49.0_(1.7)}	_{^59.9_(1.4)}	_{^5.8_(0.7)}	81.67M	config, model, log
CE	v2t	_{^23.9_(1.4)}	_{^50.2_(0.8)}	_{^59.6_(1.2)}	_{^5.6_(0.5)}	84.04M	config, model, log

Each row adds an additional component to the model. The names refer to the following model designs:

Concat: A barebones concatenation model. After aggregating each expert across time (which still requires some parameters for the variable-length VLAD layers), the experts are concatenated and compared directly against the aggregated text embeddings. Note: this model uses a slightly greater number of VLAD clusters than the others to allow the concatentated embedding to match the dimensionality of the text.
CE - MW,P,CG - The CE model without MoE weights, projecting to a common dimension or Collaborative Gating.
CE - P,CG - The CE model without projecting to a common dimension or Collaborative Gating (note that this is equivalent to the MoEE model proposed in [2]).
CE - CG - The CE model without Collaborative Gating (CG).
CE - The full CE model.

Note that in the table above some metrics have been removed to allow the number of parameters to be displayed---these additional metrics can be found in the linked logs.

Experts	Task	R@1	R@5	R@10	MdR	Params	Links
Scene	t2v	_{^7.0_(0.2)}	_{^21.8_(0.4)}	_{^32.9_(0.1)}	_{^23.0_(0.0)}	9.99M	config, model, log
Scene + Inst.	t2v	_{^16.0_(0.3)}	_{^41.8_(0.5)}	_{^57.3_(0.1)}	_{^8.0_(0.0)}	22.2M	config, model, log
Scene + r2p1d	t2v	_{^16.8_(0.0)}	_{^43.7_(0.3)}	_{^58.4_(0.1)}	_{^7.0_(0.0)}	21.02M	config, model, log
Scene + RGB	t2v	_{^10.7_(0.3)}	_{^31.5_(0.2)}	_{^45.9_(0.4)}	_{^12.7_(0.6)}	22.2M	config, model, log
Scene + Flow	t2v	_{^11.7_(0.3)}	_{^34.2_(0.3)}	_{^48.1_(0.2)}	_{^11.3_(0.6)}	21.41M	config, model, log
Scene + OCR	t2v	_{^7.1_(0.2)}	_{^22.3_(0.0)}	_{^33.8_(0.2)}	_{^22.7_(0.6)}	37.95M	config, model, log
Scene + Face	t2v	_{^6.9_(0.0)}	_{^22.4_(0.4)}	_{^33.9_(0.5)}	_{^22.7_(0.6)}	21.02M	config, model, log
Scene	v2t	_{^8.3_(0.4)}	_{^20.8_(1.1)}	_{^29.0_(0.4)}	_{^50.5_(3.6)}	9.99M	config, model, log
Scene + Inst.	v2t	_{^17.6_(0.9)}	_{^40.8_(0.3)}	_{^52.1_(0.3)}	_{^9.2_(0.3)}	22.2M	config, model, log
Scene + r2p1d	v2t	_{^20.9_(0.5)}	_{^43.7_(2.0)}	_{^53.5_(1.3)}	_{^8.2_(0.8)}	21.02M	config, model, log
Scene + RGB	v2t	_{^11.0_(0.8)}	_{^28.5_(0.2)}	_{^37.2_(0.5)}	_{^25.2_(0.7)}	22.2M	config, model, log
Scene + Flow	v2t	_{^14.4_(1.3)}	_{^34.4_(0.6)}	_{^43.8_(1.3)}	_{^15.8_(1.3)}	21.41M	config, model, log
Scene + OCR	v2t	_{^8.9_(1.4)}	_{^22.0_(2.2)}	_{^29.1_(1.0)}	_{^50.2_(4.9)}	37.95M	config, model, log
Scene + Face	v2t	_{^8.4_(0.9)}	_{^20.8_(0.7)}	_{^29.3_(0.7)}	_{^49.2_(8.1)}	21.02M	config, model, log

We can also study their cumulative effect:

Experts	Task	R@1	R@5	R@10	MdR	Params	Links
Scene	t2v	_{^7.0_(0.2)}	_{^21.8_(0.4)}	_{^32.9_(0.1)}	_{^23.0_(0.0)}	9.99M	config, model, log
Prev. + Flow	t2v	_{^11.7_(0.3)}	_{^34.2_(0.3)}	_{^48.1_(0.2)}	_{^11.3_(0.6)}	21.41M	config, model, log
Prev. + RGB	t2v	_{^12.8_(0.6)}	_{^37.4_(0.4)}	_{^52.2_(0.1)}	_{^10.0_(0.0)}	31.26M	config, model, log
Prev. + Inst	t2v	_{^16.0_(0.3)}	_{^42.1_(0.7)}	_{^57.4_(0.3)}	_{^7.7_(0.6)}	41.11M	config, model, log
Prev. + R2P1D	t2v	_{^19.3_(0.3)}	_{^48.5_(0.4)}	_{^63.3_(0.7)}	_{^6.0_(0.0)}	49.78M	config, model, log
Prev. + OCR	t2v	_{^18.6_(0.9)}	_{^47.4_(1.3)}	_{^62.6_(1.1)}	_{^6.0_(0.0)}	75.38M	config, model, log
Prev. + Face	t2v	_{^19.8_(0.3)}	_{^49.0_(0.3)}	_{^63.8_(0.1)}	_{^6.0_(0.0)}	84.04M	config, model, log
Scene	v2t	_{^8.3_(0.4)}	_{^20.8_(1.1)}	_{^29.0_(0.4)}	_{^50.5_(3.6)}	9.99M	config, model, log
Prev. + Flow	v2t	_{^14.4_(1.3)}	_{^34.4_(0.6)}	_{^43.8_(1.3)}	_{^15.8_(1.3)}	21.41M	config, model, log
Prev. + RGB	v2t	_{^14.5_(2.6)}	_{^34.9_(1.3)}	_{^44.7_(1.4)}	_{^15.2_(0.7)}	31.26M	config, model, log
Prev. + Inst.	v2t	_{^17.6_(0.8)}	_{^41.1_(1.9)}	_{^52.1_(2.2)}	_{^9.2_(1.4)}	41.11M	config, model, log
Prev. + R2P1D	v2t	_{^23.1_(0.7)}	_{^48.2_(0.7)}	_{^58.5_(0.3)}	_{^6.3_(0.4)}	49.78M	config, model, log
Prev. + OCR	v2t	_{^22.6_(2.1)}	_{^46.7_(1.9)}	_{^57.0_(2.9)}	_{^7.0_(1.0)}	75.38M	config, model, log
Prev. + Face	v2t	_{^23.9_(1.4)}	_{^50.2_(0.8)}	_{^59.6_(1.2)}	_{^5.6_(0.5)}	84.04M	config, model, log

Importance of Model Capacity: The next ablation investigates the value of the shared embedding dimension used by CE.

Dimension	Task	R@1	R@5	R@10	MdR	Params	Links
384	t2v	_{^19.3_(0.4)}	_{^48.1_(0.7)}	_{^63.3_(0.4)}	_{^6.0_(0.0)}	39.43M	config, model, log
512	t2v	_{^19.4_(0.4)}	_{^48.9_(0.4)}	_{^63.7_(0.2)}	_{^6.0_(0.0)}	53.71M	config, model, log
640	t2v	_{^19.8_(0.8)}	_{^49.5_(0.6)}	_{^64.2_(0.8)}	_{^6.0_(0.0)}	68.58M	config, model, log
768	t2v	_{^19.8_(0.3)}	_{^49.0_(0.3)}	_{^63.8_(0.1)}	_{^6.0_(0.0)}	84.04M	config, model, log
1024	t2v	_{^18.9_(0.8)}	_{^47.7_(1.6)}	_{^62.9_(1.4)}	_{^6.3_(0.6)}	116.73M	config, model, log
384	v2t	_{^21.8_(0.3)}	_{^48.8_(1.4)}	_{^59.7_(1.9)}	_{^6.2_(0.7)}	39.43M	config, model, log
512	v2t	_{^23.5_(0.8)}	_{^48.7_(0.2)}	_{^59.0_(0.8)}	_{^6.0_(0.0)}	53.71M	config, model, log
640	v2t	_{^23.8_(2.8)}	_{^48.3_(1.8)}	_{^60.1_(2.3)}	_{^6.3_(0.6)}	68.58M	config, model, log
768	v2t	_{^23.9_(1.4)}	_{^50.2_(0.8)}	_{^59.6_(1.2)}	_{^5.6_(0.5)}	84.04M	config, model, log
1024	v2t	_{^21.2_(2.7)}	_{^46.5_(1.9)}	_{^57.0_(1.6)}	_{^7.0_(1.0)}	116.73M	config, model, log

Ablation studies on ACTIVITY-NET

We conduct several ablation studies to investigate the importance of different components in the Collaborative Experts design. Each ablation is conducted on the ACTIVITY-NET dataset.

CE Design: First, we investigate the importance of the parts used by the CE model.

Model	Task	R@1	R@5	R@10	MdR	Params	Links
Concat	t2v	_{^1.2_(0.1)}	_{^5.2_(0.3)}	_{^9.4_(0.4)}	_{^120.0_(6.2)}	417.8k	config, model, log
CE - MW,P,CG	t2v	_{^17.4_(0.4)}	_{^45.8_(0.6)}	_{^61.4_(0.6)}	_{^6.3_(0.6)}	330.42M	config, model, log
CE - P,CG	t2v	_{^18.3_(0.7)}	_{^47.2_(0.6)}	_{^63.2_(0.5)}	_{^6.0_(0.0)}	330.42M	config, model, log
CE - CG	t2v	_{^17.6_(0.4)}	_{^46.8_(0.5)}	_{^62.9_(0.4)}	_{^6.0_(0.0)}	258.3M	config, model, log
CE	t2v	_{^18.2_(0.3)}	_{^47.7_(0.6)}	_{^63.9_(0.5)}	_{^6.0_(0.0)}	260.68M	config, model, log
Concat	v2t	_{^1.3_(0.1)}	_{^5.3_(0.6)}	_{^9.7_(0.6)}	_{^141.7_(2.9)}	417.8k	config, model, log
CE - MW,P,CG	v2t	_{^17.6_(0.2)}	_{^45.9_(0.5)}	_{^61.5_(0.7)}	_{^6.7_(0.6)}	330.42M	config, model, log
CE - P,CG	v2t	_{^17.2_(0.1)}	_{^46.3_(0.4)}	_{^62.5_(0.5)}	_{^6.0_(0.0)}	330.42M	config, model, log
CE - CG	v2t	_{^17.3_(0.2)}	_{^46.4_(0.3)}	_{^62.5_(0.5)}	_{^6.0_(0.0)}	258.3M	config, model, log
CE	v2t	_{^17.7_(0.6)}	_{^46.6_(0.7)}	_{^62.8_(0.4)}	_{^6.0_(0.0)}	260.68M	config, model, log

Each row adds an additional component to the model. The names refer to the following model designs:

Concat: A barebones concatenation model. After aggregating each expert across time (which still requires some parameters for the variable-length VLAD layers), the experts are concatenated and compared directly against the aggregated text embeddings. Note: this model uses a slightly greater number of VLAD clusters than the others to allow the concatentated embedding to match the dimensionality of the text.
CE - MW,P,CG - The CE model without MoE weights, projecting to a common dimension or Collaborative Gating.
CE - P,CG - The CE model without projecting to a common dimension or Collaborative Gating (note that this is equivalent to the MoEE model proposed in [2]).
CE - CG - The CE model without Collaborative Gating (CG).
CE - The full CE model.

Note that in the table above some metrics have been removed to allow the number of parameters to be displayed---these additional metrics can be found in the linked logs.

Experts	Task	R@1	R@5	R@10	MdR	Params	Links
Scene	t2v	_{^7.3_(0.4)}	_{^21.5_(0.7)}	_{^32.4_(0.2)}	_{^24.7_(0.6)}	25.97M	config, model, log
Scene + Inst.	t2v	_{^14.4_(0.3)}	_{^38.9_(0.3)}	_{^53.3_(0.2)}	_{^9.0_(0.0)}	54.13M	config, model, log
Scene + r2p1d	t2v	_{^16.0_(0.5)}	_{^43.8_(0.4)}	_{^60.6_(0.3)}	_{^7.0_(0.0)}	52.95M	config, model, log
Scene + RGB	t2v	_{^10.2_(0.7)}	_{^29.7_(0.4)}	_{^43.1_(0.7)}	_{^14.3_(0.6)}	54.13M	config, model, log
Scene + Flow	t2v	_{^13.8_(0.3)}	_{^37.9_(0.1)}	_{^53.1_(0.3)}	_{^9.0_(0.0)}	53.35M	config, model, log
Scene + Audio	t2v	_{^8.0_(0.2)}	_{^24.1_(0.7)}	_{^35.7_(0.3)}	_{^21.3_(0.6)}	53.15M	config, model, log
Scene + OCR	t2v	_{^7.4_(0.3)}	_{^23.0_(0.3)}	_{^33.8_(0.3)}	_{^23.3_(0.6)}	62.73M	config, model, log
Scene + Speech	t2v	_{^7.4_(0.1)}	_{^23.1_(0.4)}	_{^34.4_(0.6)}	_{^22.3_(0.6)}	75.66M	config, model, log
Scene + Face	t2v	_{^7.9_(0.3)}	_{^24.7_(0.8)}	_{^36.2_(0.7)}	_{^21.0_(0.0)}	52.95M	config, model, log
Scene	v2t	_{^6.4_(0.2)}	_{^20.4_(0.3)}	_{^31.4_(0.1)}	_{^25.3_(0.6)}	25.97M	config, model, log
Scene + Inst.	v2t	_{^12.4_(0.2)}	_{^35.9_(0.1)}	_{^50.5_(0.3)}	_{^10.0_(0.0)}	54.13M	config, model, log
Scene + r2p1d	v2t	_{^13.7_(0.2)}	_{^40.5_(0.2)}	_{^57.1_(0.3)}	_{^8.0_(0.0)}	52.95M	config, model, log
Scene + RGB	v2t	_{^9.3_(0.2)}	_{^28.2_(0.4)}	_{^41.3_(0.5)}	_{^15.7_(0.6)}	54.13M	config, model, log
Scene + Flow	v2t	_{^12.4_(0.1)}	_{^36.3_(0.4)}	_{^52.1_(0.3)}	_{^10.0_(0.0)}	53.35M	config, model, log
Scene + Audio	v2t	_{^7.3_(0.2)}	_{^23.2_(0.3)}	_{^34.1_(0.3)}	_{^22.0_(0.0)}	53.15M	config, model, log
Scene + OCR	v2t	_{^6.4_(0.1)}	_{^20.5_(0.8)}	_{^31.2_(0.5)}	_{^26.7_(0.6)}	62.73M	config, model, log
Scene + Speech	v2t	_{^6.4_(0.2)}	_{^20.9_(0.1)}	_{^32.4_(0.4)}	_{^24.7_(0.6)}	75.66M	config, model, log
Scene + Face	v2t	_{^7.2_(0.3)}	_{^23.1_(0.3)}	_{^34.4_(1.0)}	_{^21.7_(0.6)}	52.95M	config, model, log

We can also study their cumulative effect:

Experts	Task	R@1	R@5	R@10	MdR	Params	Links
Scene	t2v	_{^7.3_(0.4)}	_{^21.5_(0.7)}	_{^32.4_(0.2)}	_{^24.7_(0.6)}	25.97M	config, model, log
Prev. + Speech	t2v	_{^7.4_(0.1)}	_{^23.1_(0.4)}	_{^34.4_(0.6)}	_{^22.3_(0.6)}	75.66M	config, model, log
Prev. + Audio	t2v	_{^7.7_(0.3)}	_{^23.7_(0.1)}	_{^35.3_(0.3)}	_{^21.3_(1.2)}	100.47M	config, model, log
Prev. + Flow	t2v	_{^14.4_(0.1)}	_{^39.1_(0.5)}	_{^54.5_(0.1)}	_{^9.0_(0.0)}	125.48M	config, model, log
Prev. + RGB	t2v	_{^14.9_(0.6)}	_{^40.7_(0.2)}	_{^55.9_(0.2)}	_{^8.0_(0.0)}	151.27M	config, model, log
Prev. + Inst	t2v	_{^15.8_(0.6)}	_{^43.2_(0.3)}	_{^58.3_(0.4)}	_{^7.0_(0.0)}	177.07M	config, model, log
Prev. + R2P1D	t2v	_{^18.2_(0.5)}	_{^46.4_(0.5)}	_{^62.5_(1.1)}	_{^6.0_(0.0)}	201.68M	config, model, log
Prev. + OCR	t2v	_{^18.0_(0.6)}	_{^46.9_(0.3)}	_{^63.2_(0.2)}	_{^6.0_(0.0)}	236.06M	config, model, log
Prev. + Face	t2v	_{^18.4_(0.2)}	_{^47.9_(0.5)}	_{^63.6_(0.5)}	_{^6.0_(0.0)}	260.68M	config, model, log
Scene	v2t	_{^6.4_(0.2)}	_{^20.4_(0.3)}	_{^31.4_(0.1)}	_{^25.3_(0.6)}	25.97M	config, model, log
Prev. + Speech	v2t	_{^6.4_(0.2)}	_{^20.9_(0.1)}	_{^32.4_(0.4)}	_{^24.7_(0.6)}	75.66M	config, model, log
Prev. + Audio	v2t	_{^7.2_(0.2)}	_{^22.7_(0.3)}	_{^34.3_(0.2)}	_{^22.3_(0.6)}	100.47M	config, model, log
Prev. + Flow	v2t	_{^13.8_(0.3)}	_{^37.9_(0.3)}	_{^53.4_(0.2)}	_{^9.0_(0.0)}	125.48M	config, model, log
Prev. + RGB	v2t	_{^14.6_(0.6)}	_{^40.0_(0.3)}	_{^55.4_(0.4)}	_{^8.3_(0.6)}	151.27M	config, model, log
Prev. + Inst.	v2t	_{^15.4_(0.4)}	_{^42.1_(0.2)}	_{^58.1_(0.7)}	_{^7.7_(0.6)}	177.07M	config, model, log
Prev. + R2P1D	v2t	_{^17.4_(0.4)}	_{^45.7_(0.4)}	_{^62.0_(0.8)}	_{^6.3_(0.6)}	201.68M	config, model, log
Prev. + OCR	v2t	_{^17.0_(0.2)}	_{^45.8_(0.2)}	_{^61.8_(0.2)}	_{^6.3_(0.6)}	236.06M	config, model, log
Prev. + Face	v2t	_{^17.7_(0.7)}	_{^46.5_(0.5)}	_{^62.7_(0.4)}	_{^6.3_(0.6)}	260.68M	config, model, log

Importance of Model Capacity: The next ablation investigates the value of the shared embedding dimension used by CE.

Dimension	Task	R@1	R@5	R@10	MdR	Params	Links
384	t2v	_{^17.5_(0.2)}	_{^46.9_(0.0)}	_{^63.2_(0.2)}	_{^6.0_(0.0)}	127.3M	config, model, log
512	t2v	_{^18.0_(0.3)}	_{^47.4_(0.8)}	_{^63.1_(0.4)}	_{^6.0_(0.0)}	171.04M	config, model, log
640	t2v	_{^18.2_(0.6)}	_{^47.8_(1.0)}	_{^63.4_(0.9)}	_{^6.0_(0.0)}	215.5M	config, model, log
768	t2v	_{^18.2_(0.3)}	_{^47.7_(0.6)}	_{^63.9_(0.5)}	_{^6.0_(0.0)}	260.68M	config, model, log
1024	t2v	_{^18.3_(0.1)}	_{^48.2_(0.8)}	_{^63.3_(0.5)}	_{^6.0_(0.0)}	353.2M	config, model, log
384	v2t	_{^16.8_(0.3)}	_{^45.3_(0.1)}	_{^61.9_(0.4)}	_{^6.7_(0.6)}	127.3M	config, model, log
512	v2t	_{^17.2_(0.1)}	_{^45.9_(0.7)}	_{^62.0_(0.7)}	_{^6.7_(0.6)}	171.04M	config, model, log
640	v2t	_{^17.5_(0.5)}	_{^46.1_(0.5)}	_{^62.4_(0.3)}	_{^6.3_(0.6)}	215.5M	config, model, log
768	v2t	_{^17.7_(0.6)}	_{^46.6_(0.7)}	_{^62.8_(0.4)}	_{^6.0_(0.0)}	260.68M	config, model, log
1024	v2t	_{^17.7_(0.1)}	_{^47.0_(0.5)}	_{^63.4_(0.4)}	_{^6.0_(0.0)}	353.2M	config, model, log

Ablation studies on QuerYD

We conduct several ablation studies to investigate the importance of different components in the Collaborative Experts design. Each ablation is conducted on the QuerYD dataset.

Experts	Task	R@1	R@5	R@10	R@50	MdR	MnR	Geom	params	Links
Scene	t2v	_{^8.7_(0.4)}	_{^26.3_(1.1)}	_{^37.1_(0.7)}	_{^68.5_(2.2)}	_{^22.2_(1.6)}	_{^52.3_(3.0)}	_{^20.4_(0.1)}	7.51M	config, model, log
Scene + Inst.	t2v	_{^11.7_(1.4)}	_{^31.6_(0.9)}	_{^43.4_(1.3)}	_{^74.5_(0.9)}	_{^14.0_(1.0)}	_{^41.1_(2.1)}	_{^25.2_(0.8)}	17.25M	config, model, log
Scene + r2p1d	t2v	_{^11.7_(2.1)}	_{^32.1_(3.0)}	_{^45.3_(3.3)}	_{^74.6_(0.4)}	_{^13.7_(1.9)}	_{^42.9_(2.2)}	_{^25.7_(2.4)}	16.07M	config, model, log
Scene + Audio	t2v	_{^7.6_(2.7)}	_{^27.4_(1.4)}	_{^40.4_(0.9)}	_{^69.1_(0.9)}	_{^17.0_(1.7)}	_{^49.0_(1.9)}	_{^20.2_(2.3)}	17.25M	config, model, log
Scene	v2t	_{^9.1_(0.8)}	_{^25.4_(0.9)}	_{^35.3_(1.5)}	_{^68.2_(2.2)}	_{^23.2_(0.3)}	_{^52.6_(2.6)}	_{^20.1_(0.5)}	7.51M	config, model, log
Scene + Inst.	v2t	_{^11.9_(0.5)}	_{^31.0_(3.6)}	_{^43.5_(2.7)}	_{^74.8_(1.8)}	_{^14.5_(0.9)}	_{^40.8_(2.1)}	_{^25.2_(1.1)}	17.25M	config, model, log
Scene + r2p1d	v2t	_{^12.7_(1.4)}	_{^30.9_(2.8)}	_{^44.0_(1.8)}	_{^74.3_(1.2)}	_{^14.3_(1.2)}	_{^42.8_(1.7)}	_{^25.8_(1.7)}	16.07M	config, model, log
Scene + Audio	v2t	_{^10.1_(1.2)}	_{^25.7_(1.5)}	_{^37.5_(1.2)}	_{^69.8_(1.6)}	_{^20.0_(1.3)}	_{^48.9_(2.0)}	_{^21.3_(1.1)}	17.25M	config, model, log

We can also study their cumulative effect:

Experts	Task	R@1	R@5	R@10	R@50	MdR	MnR	Geom	params	Links
Scene	t2v	_{^8.7_(0.4)}	_{^26.3_(1.1)}	_{^37.1_(0.7)}	_{^68.5_(2.2)}	_{^22.2_(1.6)}	_{^52.3_(3.0)}	_{^20.4_(0.1)}	7.51M	config, model, log
Prev. + Audio	t2v	_{^7.6_(2.7)}	_{^27.4_(1.4)}	_{^40.4_(0.9)}	_{^69.1_(0.9)}	_{^17.0_(1.7)}	_{^49.0_(1.9)}	_{^20.2_(2.3)}	17.25M	config, model, log
Prev. + Inst	t2v	_{^12.7_(1.7)}	_{^34.8_(1.7)}	_{^47.0_(1.3)}	_{^78.0_(1.0)}	_{^12.3_(0.6)}	_{^37.6_(2.1)}	_{^27.5_(1.5)}	24.63M	config, model, log
Prev. + R2P1D	t2v	_{^14.3_(0.3)}	_{^37.5_(1.3)}	_{^48.6_(0.8)}	_{^78.8_(0.3)}	_{^11.3_(0.6)}	_{^35.2_(1.8)}	_{^29.7_(0.3)}	30.82M	config, model, log
Scene	v2t	_{^9.1_(0.8)}	_{^25.4_(0.9)}	_{^35.3_(1.5)}	_{^68.2_(2.2)}	_{^23.2_(0.3)}	_{^52.6_(2.6)}	_{^20.1_(0.5)}	7.51M	config, model, log
Prev. + Audio	v2t	_{^10.1_(1.2)}	_{^25.7_(1.5)}	_{^37.5_(1.2)}	_{^69.8_(1.6)}	_{^20.0_(1.3)}	_{^48.9_(2.0)}	_{^21.3_(1.1)}	17.25M	config, model, log
Prev. + Inst.	v2t	_{^12.8_(1.3)}	_{^33.5_(2.8)}	_{^46.6_(1.0)}	_{^76.7_(1.7)}	_{^11.8_(0.8)}	_{^37.6_(1.9)}	_{^27.1_(0.6)}	24.63M	config, model, log
Prev. + R2P1D	v2t	_{^14.0_(0.3)}	_{^35.4_(2.9)}	_{^47.2_(2.8)}	_{^78.7_(2.4)}	_{^12.3_(1.5)}	_{^35.8_(2.4)}	_{^28.6_(1.2)}	30.82M	config, model, log