ablations.md

December 22, 2020 ยท View on GitHub

Ablation studies on LSMDC

We conduct several ablation studies to investigate the importance of different components in the Collaborative Experts design. Each ablation is conducted on the LSMDC dataset.

CE Design: First, we investigate the importance of the parts used by the CE model.

ModelTaskR@1R@5R@10MdRParamsLinks
Concatt2v0.5(0.5)2.4(0.9)4.9(1.7)195.8(22.1)150.63kconfig, model, log
CE - MW,P,CGt2v9.7(0.3)24.0(0.9)33.3(0.5)29.5(2.5)159.78Mconfig, model, log
CE - P,CGt2v11.2(0.7)26.1(0.9)35.3(0.4)25.7(2.1)159.78Mconfig, model, log
CE - CGt2v10.6(0.4)26.9(0.5)35.3(0.5)27.2(1.8)114.48Mconfig, model, log
CEt2v11.2(0.4)26.9(1.1)34.8(2.0)25.3(3.1)116.86Mconfig, model, log
Concatv2t0.6(0.4)3.1(0.7)5.7(1.3)184.7(20.7)150.63kconfig, model, log
CE - MW,P,CGv2t11.1(1.0)24.5(0.9)32.5(0.8)32.2(1.9)159.78Mconfig, model, log
CE - P,CGv2t11.5(0.2)25.3(1.0)34.3(0.3)28.0(3.0)159.78Mconfig, model, log
CE - CGv2t10.8(0.2)26.1(0.6)34.0(0.8)28.8(0.3)114.48Mconfig, model, log
CEv2t11.7(0.5)25.8(1.5)34.4(1.7)28.0(2.6)116.86Mconfig, model, log

Each row adds an additional component to the model. The names refer to the following model designs:

  • Concat: A barebones concatenation model. After aggregating each expert across time (which still requires some parameters for the variable-length VLAD layers), the experts are concatenated and compared directly against the aggregated text embeddings. Note: this model uses a slightly greater number of VLAD clusters than the others to allow the concatentated embedding to match the dimensionality of the text.
  • CE - MW,P,CG - The CE model without MoE weights, projecting to a common dimension or Collaborative Gating.
  • CE - P,CG - The CE model without projecting to a common dimension or Collaborative Gating (note that this is equivalent to the MoEE model proposed in [2]).
  • CE - CG - The CE model without Collaborative Gating (CG).
  • CE - The full CE model.

Note that in the table above some metrics have been removed to allow the number of parameters to be displayed---these additional metrics can be found in the linked logs.

Importance of Different Experts: The next ablation investigates the value of each of the different experts towards the final embedding. Since not all experts are available in every video, we pair each expert with scene features, to give an approximation of their usefulness.

ExpertsTaskR@1R@5R@10MdRParamsLinks
Scenet2v4.2(0.4)12.6(0.4)18.6(0.6)85.5(0.5)12.83Mconfig, model, log
Scene + Inst.t2v8.0(1.9)19.4(1.5)27.1(1.1)45.2(3.3)27.87Mconfig, model, log
Scene + r2p1dt2v8.0(0.3)20.5(0.6)29.1(1.0)39.8(2.1)26.69Mconfig, model, log
Scene + RGBt2v5.9(0.5)16.3(0.4)22.4(0.3)59.3(2.5)27.87Mconfig, model, log
Scene + Flowt2v6.4(0.9)18.2(0.9)26.2(1.0)48.5(0.9)27.09Mconfig, model, log
Scene + Audiot2v6.3(0.1)15.9(0.6)24.4(0.5)50.7(1.5)26.6Mconfig, model, log
Scene + OCRt2v3.8(0.8)11.9(0.7)17.5(1.3)88.8(9.0)33.23Mconfig, model, log
Scene + Speecht2v3.8(0.4)12.3(0.5)18.1(0.6)83.8(6.5)27.46Mconfig, model, log
Scene + Facet2v4.4(0.4)12.6(0.2)19.2(0.6)83.8(6.0)26.4Mconfig, model, log
Scenev2t4.6(0.4)12.6(1.1)18.0(0.9)91.8(2.0)12.83Mconfig, model, log
Scene + Inst.v2t7.6(0.9)19.7(0.1)27.3(0.9)47.5(1.3)27.87Mconfig, model, log
Scene + r2p1dv2t7.9(0.2)20.7(0.8)28.0(0.7)42.7(4.3)26.69Mconfig, model, log
Scene + RGBv2t6.0(0.2)16.4(0.6)22.8(0.8)64.0(4.8)27.87Mconfig, model, log
Scene + Flowv2t6.6(1.2)17.4(0.4)24.8(0.9)50.2(1.4)27.09Mconfig, model, log
Scene + Audiov2t6.3(0.5)17.3(0.5)23.8(0.8)56.3(3.5)26.6Mconfig, model, log
Scene + OCRv2t4.8(0.6)12.4(0.7)17.6(0.4)96.7(16.4)33.23Mconfig, model, log
Scene + Speechv2t3.9(0.2)11.9(0.1)17.8(0.3)84.8(4.4)27.46Mconfig, model, log
Scene + Facev2t4.9(0.3)13.2(0.4)19.1(1.7)93.2(7.5)26.4Mconfig, model, log

We can also study their cumulative effect:

ExpertsTaskR@1R@5R@10MdRParamsLinks
Scenet2v4.2(0.4)12.6(0.4)18.6(0.6)85.5(0.5)12.83Mconfig, model, log
Prev. + Speecht2v3.8(0.4)12.3(0.5)18.1(0.6)83.8(6.5)27.46Mconfig, model, log
Prev. + Audiot2v6.0(0.5)17.2(0.6)24.4(0.7)50.5(1.3)38.86Mconfig, model, log
Prev. + Flowt2v7.6(0.7)20.1(0.6)28.1(0.7)37.7(2.8)50.75Mconfig, model, log
Prev. + RGBt2v8.9(0.9)21.8(0.2)29.9(0.7)34.5(1.3)63.43Mconfig, model, log
Prev. + Instt2v10.4(1.3)25.6(1.0)34.2(1.0)28.3(1.5)76.11Mconfig, model, log
Prev. + R2P1Dt2v11.3(0.4)27.4(0.9)36.3(1.1)25.7(1.2)87.61Mconfig, model, log
Prev. + OCRt2v11.5(0.5)26.2(0.6)35.8(0.5)25.7(1.8)105.65Mconfig, model, log
Prev. + Facet2v11.3(0.3)26.7(1.5)35.1(1.6)25.3(3.1)116.86Mconfig, model, log
Scenev2t4.6(0.4)12.6(1.1)18.0(0.9)91.8(2.0)12.83Mconfig, model, log
Prev. + Speechv2t3.9(0.2)11.9(0.1)17.8(0.3)84.8(4.4)27.46Mconfig, model, log
Prev. + Audiov2t7.2(0.3)18.2(0.3)24.7(1.2)57.0(2.2)38.86Mconfig, model, log
Prev. + Flowv2t7.9(0.6)20.5(0.4)28.6(0.5)41.3(2.5)50.75Mconfig, model, log
Prev. + RGBv2t9.3(0.6)21.9(1.2)30.1(0.8)34.8(1.3)63.43Mconfig, model, log
Prev. + Inst.v2t11.1(0.8)25.1(1.6)33.8(1.0)30.0(1.3)76.11Mconfig, model, log
Prev. + R2P1Dv2t11.8(0.2)26.6(0.9)35.8(0.4)27.7(2.1)87.61Mconfig, model, log
Prev. + OCRv2t11.2(0.3)26.0(1.3)33.8(0.9)27.4(1.2)105.65Mconfig, model, log
Prev. + Facev2t11.6(0.5)25.8(1.4)34.4(1.7)27.7(2.5)116.86Mconfig, model, log

Importance of Model Capacity: The next ablation investigates the value of the shared embedding dimension used by CE.

DimensionTaskR@1R@5R@10MdRParamsLinks
384t2v11.1(0.6)26.6(0.6)35.8(0.1)26.2(2.3)55.27Mconfig, model, log
512t2v11.2(0.4)26.2(0.8)35.8(0.6)26.3(2.1)75.08Mconfig, model, log
640t2v11.7(0.6)26.9(1.4)36.0(1.3)25.3(0.6)95.61Mconfig, model, log
768t2v11.2(0.4)26.9(1.1)34.8(2.0)25.3(3.1)116.86Mconfig, model, log
1024t2v11.3(0.3)27.2(0.9)36.1(1.3)24.7(1.5)161.52Mconfig, model, log
384v2t11.3(0.7)26.0(1.1)34.5(1.4)28.8(1.6)55.27Mconfig, model, log
512v2t11.0(0.5)26.2(0.2)35.0(1.5)29.7(1.2)75.08Mconfig, model, log
640v2t12.0(0.6)26.1(1.5)34.1(1.1)28.2(1.0)95.61Mconfig, model, log
768v2t11.7(0.5)25.8(1.5)34.4(1.7)28.0(2.6)116.86Mconfig, model, log
1024v2t11.4(0.9)26.6(0.7)35.3(1.1)27.5(3.5)161.52Mconfig, model, log

Ablation studies on DIDEMO

We conduct several ablation studies to investigate the importance of different components in the Collaborative Experts design. Each ablation is conducted on the DIDEMO dataset.

CE Design: First, we investigate the importance of the parts used by the CE model.

ModelTaskR@1R@5R@10MdRParamsLinks
Concatt2v1.6(0.1)7.9(0.6)14.2(0.7)65.5(2.3)374.62kconfig, model, log
CE - MW,P,CGt2v14.8(0.3)40.4(1.7)54.7(0.8)8.7(0.6)107.26Mconfig, model, log
CE - P,CGt2v15.4(0.8)42.0(0.2)55.2(0.8)8.3(0.6)107.26Mconfig, model, log
CE - CGt2v14.6(1.0)40.1(0.2)54.2(0.2)8.7(0.6)76.91Mconfig, model, log
CEt2v16.1(1.4)41.1(0.4)54.4(0.8)8.3(0.6)79.29Mconfig, model, log
Concatv2t2.4(0.5)8.5(0.6)14.0(1.0)72.5(4.4)374.62kconfig, model, log
CE - MW,P,CGv2t15.9(0.4)41.1(0.5)54.9(0.7)8.3(0.6)107.26Mconfig, model, log
CE - P,CGv2t16.2(0.3)41.4(0.2)54.4(0.9)8.3(0.6)107.26Mconfig, model, log
CE - CGv2t15.7(0.9)39.6(0.8)53.7(0.9)9.0(0.0)76.91Mconfig, model, log
CEv2t15.6(1.3)40.9(0.4)55.2(0.5)8.2(0.3)79.29Mconfig, model, log

Each row adds an additional component to the model. The names refer to the following model designs:

  • Concat: A barebones concatenation model. After aggregating each expert across time (which still requires some parameters for the variable-length VLAD layers), the experts are concatenated and compared directly against the aggregated text embeddings. Note: this model uses a slightly greater number of VLAD clusters than the others to allow the concatentated embedding to match the dimensionality of the text.
  • CE - MW,P,CG - The CE model without MoE weights, projecting to a common dimension or Collaborative Gating.
  • CE - P,CG - The CE model without projecting to a common dimension or Collaborative Gating (note that this is equivalent to the MoEE model proposed in [2]).
  • CE - CG - The CE model without Collaborative Gating (CG).
  • CE - The full CE model.

Note that in the table above some metrics have been removed to allow the number of parameters to be displayed---these additional metrics can be found in the linked logs.

Importance of Different Experts: The next ablation investigates the value of each of the different experts towards the final embedding. Since not all experts are available in every video, we pair each expert with scene features, to give an approximation of their usefulness.

ExpertsTaskR@1R@5R@10MdRParamsLinks
Scenet2v6.7(1.3)20.8(0.7)32.7(0.9)25.0(1.0)7.62Mconfig, model, log
Scene + Inst.t2v12.1(0.6)31.7(0.2)46.0(1.3)13.0(1.0)17.47Mconfig, model, log
Scene + r2p1dt2v13.0(0.8)36.8(1.0)52.0(0.3)10.0(0.0)16.29Mconfig, model, log
Scene + RGBt2v8.7(1.6)26.4(1.0)39.7(0.4)16.8(0.8)17.47Mconfig, model, log
Scene + Flowt2v10.1(0.7)29.4(1.1)42.9(0.8)14.5(0.9)16.68Mconfig, model, log
Scene + Audiot2v6.3(0.3)23.0(0.8)33.6(0.7)22.7(1.3)17.47Mconfig, model, log
Scene + OCRt2v6.1(0.6)20.5(0.7)32.3(1.1)25.7(1.2)18.9Mconfig, model, log
Scene + Speecht2v6.1(0.5)21.0(1.0)30.7(0.4)27.2(0.3)28.6Mconfig, model, log
Scene + Facet2v6.4(0.1)20.8(1.2)31.9(0.7)24.8(1.8)16.29Mconfig, model, log
Scenev2t6.6(0.7)21.3(0.4)33.0(0.4)25.2(2.5)7.62Mconfig, model, log
Scene + Inst.v2t12.5(1.3)33.0(0.5)46.1(0.6)13.3(1.2)17.47Mconfig, model, log
Scene + r2p1dv2t13.5(0.6)36.2(0.2)51.3(1.3)10.3(0.6)16.29Mconfig, model, log
Scene + RGBv2t8.5(1.0)26.8(1.4)38.8(0.7)16.7(0.6)17.47Mconfig, model, log
Scene + Flowv2t11.3(0.2)29.8(0.6)42.1(1.2)15.3(1.5)16.68Mconfig, model, log
Scene + Audiov2t6.6(0.4)23.0(2.3)33.6(1.0)22.2(2.0)17.47Mconfig, model, log
Scene + OCRv2t6.6(0.3)20.9(1.2)32.1(0.7)26.2(1.8)18.9Mconfig, model, log
Scene + Speechv2t6.8(0.8)21.1(0.9)31.4(1.2)27.3(0.6)28.6Mconfig, model, log
Scene + Facev2t6.8(0.2)20.7(0.3)32.1(1.5)25.7(1.5)16.29Mconfig, model, log

We can also study their cumulative effect:

ExpertsTaskR@1R@5R@10MdRParamsLinks
Scenet2v6.7(1.3)20.8(0.7)32.7(0.9)25.0(1.0)7.62Mconfig, model, log
Prev. + Speecht2v6.1(0.5)21.0(1.0)30.7(0.4)27.2(0.3)28.6Mconfig, model, log
Prev. + Audiot2v6.8(0.1)23.1(0.3)33.6(0.5)21.8(0.8)36.09Mconfig, model, log
Prev. + Flowt2v10.9(1.1)31.0(1.1)44.1(1.1)14.0(1.0)42.79Mconfig, model, log
Prev. + RGBt2v12.0(1.1)33.0(0.1)45.9(0.4)12.7(0.6)50.27Mconfig, model, log
Prev. + Instt2v13.7(2.1)34.6(1.9)49.4(1.0)11.0(1.0)57.76Mconfig, model, log
Prev. + R2P1Dt2v15.5(0.7)40.1(0.2)54.9(1.5)8.3(0.6)64.06Mconfig, model, log
Prev. + OCRt2v15.6(0.5)39.9(1.1)54.2(1.1)9.0(0.0)72.98Mconfig, model, log
Prev. + Facet2v16.1(1.5)41.2(0.5)54.5(0.9)8.3(0.6)79.29Mconfig, model, log
Scenev2t6.6(0.7)21.3(0.4)33.0(0.4)25.2(2.5)7.62Mconfig, model, log
Prev. + Speechv2t6.8(0.8)21.1(0.9)31.4(1.2)27.3(0.6)28.6Mconfig, model, log
Prev. + Audiov2t7.1(0.3)22.8(0.5)33.9(0.2)22.2(0.8)36.09Mconfig, model, log
Prev. + Flowv2t11.4(0.5)30.8(0.9)44.0(1.0)14.0(1.0)42.79Mconfig, model, log
Prev. + RGBv2t11.9(0.2)32.6(1.8)45.7(1.6)12.7(0.6)50.27Mconfig, model, log
Prev. + Inst.v2t12.9(0.6)36.1(1.1)49.2(1.3)11.0(1.0)57.76Mconfig, model, log
Prev. + R2P1Dv2t15.6(0.5)40.3(0.5)54.3(0.9)8.7(0.6)64.06Mconfig, model, log
Prev. + OCRv2t16.1(1.5)38.6(1.0)53.4(1.0)9.0(1.0)72.98Mconfig, model, log
Prev. + Facev2t16.2(1.6)41.1(0.7)55.1(0.3)8.2(0.3)79.29Mconfig, model, log

Importance of Model Capacity: The next ablation investigates the value of the shared embedding dimension used by CE.

DimensionTaskR@1R@5R@10MdRParamsLinks
384t2v15.2(0.8)40.2(1.5)54.6(1.0)8.7(0.6)36.45Mconfig, model, log
512t2v15.3(0.8)39.9(0.9)53.7(0.9)9.0(0.0)50.01Mconfig, model, log
640t2v16.2(0.6)40.2(0.3)54.3(0.2)8.7(0.6)64.29Mconfig, model, log
768t2v16.1(1.4)41.1(0.4)54.4(0.8)8.3(0.6)79.29Mconfig, model, log
1024t2v15.6(0.8)39.3(1.2)53.8(0.4)8.7(0.6)111.44Mconfig, model, log
384v2t15.0(0.2)40.2(1.6)53.6(2.5)9.0(1.0)36.45Mconfig, model, log
512v2t15.6(2.7)39.2(1.0)52.4(0.5)9.3(0.6)50.01Mconfig, model, log
640v2t15.8(0.8)40.1(0.1)54.4(0.7)9.0(0.0)64.29Mconfig, model, log
768v2t15.6(1.3)40.9(0.4)55.2(0.5)8.2(0.3)79.29Mconfig, model, log
1024v2t15.6(1.1)39.9(0.6)54.1(1.0)8.7(0.6)111.44Mconfig, model, log

Ablation studies on MSVD

We conduct several ablation studies to investigate the importance of different components in the Collaborative Experts design. Each ablation is conducted on the MSVD dataset.

CE Design: First, we investigate the importance of the parts used by the CE model.

ModelTaskR@1R@5R@10MdRParamsLinks
Concatt2v3.5(0.1)13.9(0.2)24.1(0.3)32.7(0.6)314.8kconfig, model, log
CE - MW,P,CGt2v16.8(0.1)44.8(0.5)59.6(0.6)7.0(0.0)131.37Mconfig, model, log
CE - P,CGt2v18.9(1.0)48.1(1.0)63.2(0.7)6.0(0.0)131.37Mconfig, model, log
CE - CGt2v19.6(0.5)49.4(0.8)64.2(1.0)5.7(0.6)81.67Mconfig, model, log
CEt2v19.8(0.3)49.0(0.3)63.8(0.1)6.0(0.0)84.04Mconfig, model, log
Concatv2t4.0(0.6)14.9(0.8)22.4(0.8)42.5(0.9)314.8kconfig, model, log
CE - MW,P,CGv2t21.7(0.3)47.6(0.1)58.2(0.4)6.4(0.5)131.37Mconfig, model, log
CE - P,CGv2t22.9(2.5)48.6(1.2)58.3(1.3)6.2(0.4)131.37Mconfig, model, log
CE - CGv2t23.4(2.5)49.0(1.7)59.9(1.4)5.8(0.7)81.67Mconfig, model, log
CEv2t23.9(1.4)50.2(0.8)59.6(1.2)5.6(0.5)84.04Mconfig, model, log

Each row adds an additional component to the model. The names refer to the following model designs:

  • Concat: A barebones concatenation model. After aggregating each expert across time (which still requires some parameters for the variable-length VLAD layers), the experts are concatenated and compared directly against the aggregated text embeddings. Note: this model uses a slightly greater number of VLAD clusters than the others to allow the concatentated embedding to match the dimensionality of the text.
  • CE - MW,P,CG - The CE model without MoE weights, projecting to a common dimension or Collaborative Gating.
  • CE - P,CG - The CE model without projecting to a common dimension or Collaborative Gating (note that this is equivalent to the MoEE model proposed in [2]).
  • CE - CG - The CE model without Collaborative Gating (CG).
  • CE - The full CE model.

Note that in the table above some metrics have been removed to allow the number of parameters to be displayed---these additional metrics can be found in the linked logs.

Importance of Different Experts: The next ablation investigates the value of each of the different experts towards the final embedding. Since not all experts are available in every video, we pair each expert with scene features, to give an approximation of their usefulness.

ExpertsTaskR@1R@5R@10MdRParamsLinks
Scenet2v7.0(0.2)21.8(0.4)32.9(0.1)23.0(0.0)9.99Mconfig, model, log
Scene + Inst.t2v16.0(0.3)41.8(0.5)57.3(0.1)8.0(0.0)22.2Mconfig, model, log
Scene + r2p1dt2v16.8(0.0)43.7(0.3)58.4(0.1)7.0(0.0)21.02Mconfig, model, log
Scene + RGBt2v10.7(0.3)31.5(0.2)45.9(0.4)12.7(0.6)22.2Mconfig, model, log
Scene + Flowt2v11.7(0.3)34.2(0.3)48.1(0.2)11.3(0.6)21.41Mconfig, model, log
Scene + OCRt2v7.1(0.2)22.3(0.0)33.8(0.2)22.7(0.6)37.95Mconfig, model, log
Scene + Facet2v6.9(0.0)22.4(0.4)33.9(0.5)22.7(0.6)21.02Mconfig, model, log
Scenev2t8.3(0.4)20.8(1.1)29.0(0.4)50.5(3.6)9.99Mconfig, model, log
Scene + Inst.v2t17.6(0.9)40.8(0.3)52.1(0.3)9.2(0.3)22.2Mconfig, model, log
Scene + r2p1dv2t20.9(0.5)43.7(2.0)53.5(1.3)8.2(0.8)21.02Mconfig, model, log
Scene + RGBv2t11.0(0.8)28.5(0.2)37.2(0.5)25.2(0.7)22.2Mconfig, model, log
Scene + Flowv2t14.4(1.3)34.4(0.6)43.8(1.3)15.8(1.3)21.41Mconfig, model, log
Scene + OCRv2t8.9(1.4)22.0(2.2)29.1(1.0)50.2(4.9)37.95Mconfig, model, log
Scene + Facev2t8.4(0.9)20.8(0.7)29.3(0.7)49.2(8.1)21.02Mconfig, model, log

We can also study their cumulative effect:

ExpertsTaskR@1R@5R@10MdRParamsLinks
Scenet2v7.0(0.2)21.8(0.4)32.9(0.1)23.0(0.0)9.99Mconfig, model, log
Prev. + Flowt2v11.7(0.3)34.2(0.3)48.1(0.2)11.3(0.6)21.41Mconfig, model, log
Prev. + RGBt2v12.8(0.6)37.4(0.4)52.2(0.1)10.0(0.0)31.26Mconfig, model, log
Prev. + Instt2v16.0(0.3)42.1(0.7)57.4(0.3)7.7(0.6)41.11Mconfig, model, log
Prev. + R2P1Dt2v19.3(0.3)48.5(0.4)63.3(0.7)6.0(0.0)49.78Mconfig, model, log
Prev. + OCRt2v18.6(0.9)47.4(1.3)62.6(1.1)6.0(0.0)75.38Mconfig, model, log
Prev. + Facet2v19.8(0.3)49.0(0.3)63.8(0.1)6.0(0.0)84.04Mconfig, model, log
Scenev2t8.3(0.4)20.8(1.1)29.0(0.4)50.5(3.6)9.99Mconfig, model, log
Prev. + Flowv2t14.4(1.3)34.4(0.6)43.8(1.3)15.8(1.3)21.41Mconfig, model, log
Prev. + RGBv2t14.5(2.6)34.9(1.3)44.7(1.4)15.2(0.7)31.26Mconfig, model, log
Prev. + Inst.v2t17.6(0.8)41.1(1.9)52.1(2.2)9.2(1.4)41.11Mconfig, model, log
Prev. + R2P1Dv2t23.1(0.7)48.2(0.7)58.5(0.3)6.3(0.4)49.78Mconfig, model, log
Prev. + OCRv2t22.6(2.1)46.7(1.9)57.0(2.9)7.0(1.0)75.38Mconfig, model, log
Prev. + Facev2t23.9(1.4)50.2(0.8)59.6(1.2)5.6(0.5)84.04Mconfig, model, log

Importance of Model Capacity: The next ablation investigates the value of the shared embedding dimension used by CE.

DimensionTaskR@1R@5R@10MdRParamsLinks
384t2v19.3(0.4)48.1(0.7)63.3(0.4)6.0(0.0)39.43Mconfig, model, log
512t2v19.4(0.4)48.9(0.4)63.7(0.2)6.0(0.0)53.71Mconfig, model, log
640t2v19.8(0.8)49.5(0.6)64.2(0.8)6.0(0.0)68.58Mconfig, model, log
768t2v19.8(0.3)49.0(0.3)63.8(0.1)6.0(0.0)84.04Mconfig, model, log
1024t2v18.9(0.8)47.7(1.6)62.9(1.4)6.3(0.6)116.73Mconfig, model, log
384v2t21.8(0.3)48.8(1.4)59.7(1.9)6.2(0.7)39.43Mconfig, model, log
512v2t23.5(0.8)48.7(0.2)59.0(0.8)6.0(0.0)53.71Mconfig, model, log
640v2t23.8(2.8)48.3(1.8)60.1(2.3)6.3(0.6)68.58Mconfig, model, log
768v2t23.9(1.4)50.2(0.8)59.6(1.2)5.6(0.5)84.04Mconfig, model, log
1024v2t21.2(2.7)46.5(1.9)57.0(1.6)7.0(1.0)116.73Mconfig, model, log

Ablation studies on ACTIVITY-NET

We conduct several ablation studies to investigate the importance of different components in the Collaborative Experts design. Each ablation is conducted on the ACTIVITY-NET dataset.

CE Design: First, we investigate the importance of the parts used by the CE model.

ModelTaskR@1R@5R@10MdRParamsLinks
Concatt2v1.2(0.1)5.2(0.3)9.4(0.4)120.0(6.2)417.8kconfig, model, log
CE - MW,P,CGt2v17.4(0.4)45.8(0.6)61.4(0.6)6.3(0.6)330.42Mconfig, model, log
CE - P,CGt2v18.3(0.7)47.2(0.6)63.2(0.5)6.0(0.0)330.42Mconfig, model, log
CE - CGt2v17.6(0.4)46.8(0.5)62.9(0.4)6.0(0.0)258.3Mconfig, model, log
CEt2v18.2(0.3)47.7(0.6)63.9(0.5)6.0(0.0)260.68Mconfig, model, log
Concatv2t1.3(0.1)5.3(0.6)9.7(0.6)141.7(2.9)417.8kconfig, model, log
CE - MW,P,CGv2t17.6(0.2)45.9(0.5)61.5(0.7)6.7(0.6)330.42Mconfig, model, log
CE - P,CGv2t17.2(0.1)46.3(0.4)62.5(0.5)6.0(0.0)330.42Mconfig, model, log
CE - CGv2t17.3(0.2)46.4(0.3)62.5(0.5)6.0(0.0)258.3Mconfig, model, log
CEv2t17.7(0.6)46.6(0.7)62.8(0.4)6.0(0.0)260.68Mconfig, model, log

Each row adds an additional component to the model. The names refer to the following model designs:

  • Concat: A barebones concatenation model. After aggregating each expert across time (which still requires some parameters for the variable-length VLAD layers), the experts are concatenated and compared directly against the aggregated text embeddings. Note: this model uses a slightly greater number of VLAD clusters than the others to allow the concatentated embedding to match the dimensionality of the text.
  • CE - MW,P,CG - The CE model without MoE weights, projecting to a common dimension or Collaborative Gating.
  • CE - P,CG - The CE model without projecting to a common dimension or Collaborative Gating (note that this is equivalent to the MoEE model proposed in [2]).
  • CE - CG - The CE model without Collaborative Gating (CG).
  • CE - The full CE model.

Note that in the table above some metrics have been removed to allow the number of parameters to be displayed---these additional metrics can be found in the linked logs.

Importance of Different Experts: The next ablation investigates the value of each of the different experts towards the final embedding. Since not all experts are available in every video, we pair each expert with scene features, to give an approximation of their usefulness.

ExpertsTaskR@1R@5R@10MdRParamsLinks
Scenet2v7.3(0.4)21.5(0.7)32.4(0.2)24.7(0.6)25.97Mconfig, model, log
Scene + Inst.t2v14.4(0.3)38.9(0.3)53.3(0.2)9.0(0.0)54.13Mconfig, model, log
Scene + r2p1dt2v16.0(0.5)43.8(0.4)60.6(0.3)7.0(0.0)52.95Mconfig, model, log
Scene + RGBt2v10.2(0.7)29.7(0.4)43.1(0.7)14.3(0.6)54.13Mconfig, model, log
Scene + Flowt2v13.8(0.3)37.9(0.1)53.1(0.3)9.0(0.0)53.35Mconfig, model, log
Scene + Audiot2v8.0(0.2)24.1(0.7)35.7(0.3)21.3(0.6)53.15Mconfig, model, log
Scene + OCRt2v7.4(0.3)23.0(0.3)33.8(0.3)23.3(0.6)62.73Mconfig, model, log
Scene + Speecht2v7.4(0.1)23.1(0.4)34.4(0.6)22.3(0.6)75.66Mconfig, model, log
Scene + Facet2v7.9(0.3)24.7(0.8)36.2(0.7)21.0(0.0)52.95Mconfig, model, log
Scenev2t6.4(0.2)20.4(0.3)31.4(0.1)25.3(0.6)25.97Mconfig, model, log
Scene + Inst.v2t12.4(0.2)35.9(0.1)50.5(0.3)10.0(0.0)54.13Mconfig, model, log
Scene + r2p1dv2t13.7(0.2)40.5(0.2)57.1(0.3)8.0(0.0)52.95Mconfig, model, log
Scene + RGBv2t9.3(0.2)28.2(0.4)41.3(0.5)15.7(0.6)54.13Mconfig, model, log
Scene + Flowv2t12.4(0.1)36.3(0.4)52.1(0.3)10.0(0.0)53.35Mconfig, model, log
Scene + Audiov2t7.3(0.2)23.2(0.3)34.1(0.3)22.0(0.0)53.15Mconfig, model, log
Scene + OCRv2t6.4(0.1)20.5(0.8)31.2(0.5)26.7(0.6)62.73Mconfig, model, log
Scene + Speechv2t6.4(0.2)20.9(0.1)32.4(0.4)24.7(0.6)75.66Mconfig, model, log
Scene + Facev2t7.2(0.3)23.1(0.3)34.4(1.0)21.7(0.6)52.95Mconfig, model, log

We can also study their cumulative effect:

ExpertsTaskR@1R@5R@10MdRParamsLinks
Scenet2v7.3(0.4)21.5(0.7)32.4(0.2)24.7(0.6)25.97Mconfig, model, log
Prev. + Speecht2v7.4(0.1)23.1(0.4)34.4(0.6)22.3(0.6)75.66Mconfig, model, log
Prev. + Audiot2v7.7(0.3)23.7(0.1)35.3(0.3)21.3(1.2)100.47Mconfig, model, log
Prev. + Flowt2v14.4(0.1)39.1(0.5)54.5(0.1)9.0(0.0)125.48Mconfig, model, log
Prev. + RGBt2v14.9(0.6)40.7(0.2)55.9(0.2)8.0(0.0)151.27Mconfig, model, log
Prev. + Instt2v15.8(0.6)43.2(0.3)58.3(0.4)7.0(0.0)177.07Mconfig, model, log
Prev. + R2P1Dt2v18.2(0.5)46.4(0.5)62.5(1.1)6.0(0.0)201.68Mconfig, model, log
Prev. + OCRt2v18.0(0.6)46.9(0.3)63.2(0.2)6.0(0.0)236.06Mconfig, model, log
Prev. + Facet2v18.4(0.2)47.9(0.5)63.6(0.5)6.0(0.0)260.68Mconfig, model, log
Scenev2t6.4(0.2)20.4(0.3)31.4(0.1)25.3(0.6)25.97Mconfig, model, log
Prev. + Speechv2t6.4(0.2)20.9(0.1)32.4(0.4)24.7(0.6)75.66Mconfig, model, log
Prev. + Audiov2t7.2(0.2)22.7(0.3)34.3(0.2)22.3(0.6)100.47Mconfig, model, log
Prev. + Flowv2t13.8(0.3)37.9(0.3)53.4(0.2)9.0(0.0)125.48Mconfig, model, log
Prev. + RGBv2t14.6(0.6)40.0(0.3)55.4(0.4)8.3(0.6)151.27Mconfig, model, log
Prev. + Inst.v2t15.4(0.4)42.1(0.2)58.1(0.7)7.7(0.6)177.07Mconfig, model, log
Prev. + R2P1Dv2t17.4(0.4)45.7(0.4)62.0(0.8)6.3(0.6)201.68Mconfig, model, log
Prev. + OCRv2t17.0(0.2)45.8(0.2)61.8(0.2)6.3(0.6)236.06Mconfig, model, log
Prev. + Facev2t17.7(0.7)46.5(0.5)62.7(0.4)6.3(0.6)260.68Mconfig, model, log

Importance of Model Capacity: The next ablation investigates the value of the shared embedding dimension used by CE.

DimensionTaskR@1R@5R@10MdRParamsLinks
384t2v17.5(0.2)46.9(0.0)63.2(0.2)6.0(0.0)127.3Mconfig, model, log
512t2v18.0(0.3)47.4(0.8)63.1(0.4)6.0(0.0)171.04Mconfig, model, log
640t2v18.2(0.6)47.8(1.0)63.4(0.9)6.0(0.0)215.5Mconfig, model, log
768t2v18.2(0.3)47.7(0.6)63.9(0.5)6.0(0.0)260.68Mconfig, model, log
1024t2v18.3(0.1)48.2(0.8)63.3(0.5)6.0(0.0)353.2Mconfig, model, log
384v2t16.8(0.3)45.3(0.1)61.9(0.4)6.7(0.6)127.3Mconfig, model, log
512v2t17.2(0.1)45.9(0.7)62.0(0.7)6.7(0.6)171.04Mconfig, model, log
640v2t17.5(0.5)46.1(0.5)62.4(0.3)6.3(0.6)215.5Mconfig, model, log
768v2t17.7(0.6)46.6(0.7)62.8(0.4)6.0(0.0)260.68Mconfig, model, log
1024v2t17.7(0.1)47.0(0.5)63.4(0.4)6.0(0.0)353.2Mconfig, model, log

Ablation studies on QuerYD

We conduct several ablation studies to investigate the importance of different components in the Collaborative Experts design. Each ablation is conducted on the QuerYD dataset.

ExpertsTaskR@1R@5R@10R@50MdRMnRGeomparamsLinks
Scenet2v8.7(0.4)26.3(1.1)37.1(0.7)68.5(2.2)22.2(1.6)52.3(3.0)20.4(0.1)7.51Mconfig, model, log
Scene + Inst.t2v11.7(1.4)31.6(0.9)43.4(1.3)74.5(0.9)14.0(1.0)41.1(2.1)25.2(0.8)17.25Mconfig, model, log
Scene + r2p1dt2v11.7(2.1)32.1(3.0)45.3(3.3)74.6(0.4)13.7(1.9)42.9(2.2)25.7(2.4)16.07Mconfig, model, log
Scene + Audiot2v7.6(2.7)27.4(1.4)40.4(0.9)69.1(0.9)17.0(1.7)49.0(1.9)20.2(2.3)17.25Mconfig, model, log
Scenev2t9.1(0.8)25.4(0.9)35.3(1.5)68.2(2.2)23.2(0.3)52.6(2.6)20.1(0.5)7.51Mconfig, model, log
Scene + Inst.v2t11.9(0.5)31.0(3.6)43.5(2.7)74.8(1.8)14.5(0.9)40.8(2.1)25.2(1.1)17.25Mconfig, model, log
Scene + r2p1dv2t12.7(1.4)30.9(2.8)44.0(1.8)74.3(1.2)14.3(1.2)42.8(1.7)25.8(1.7)16.07Mconfig, model, log
Scene + Audiov2t10.1(1.2)25.7(1.5)37.5(1.2)69.8(1.6)20.0(1.3)48.9(2.0)21.3(1.1)17.25Mconfig, model, log

We can also study their cumulative effect:

ExpertsTaskR@1R@5R@10R@50MdRMnRGeomparamsLinks
Scenet2v8.7(0.4)26.3(1.1)37.1(0.7)68.5(2.2)22.2(1.6)52.3(3.0)20.4(0.1)7.51Mconfig, model, log
Prev. + Audiot2v7.6(2.7)27.4(1.4)40.4(0.9)69.1(0.9)17.0(1.7)49.0(1.9)20.2(2.3)17.25Mconfig, model, log
Prev. + Instt2v12.7(1.7)34.8(1.7)47.0(1.3)78.0(1.0)12.3(0.6)37.6(2.1)27.5(1.5)24.63Mconfig, model, log
Prev. + R2P1Dt2v14.3(0.3)37.5(1.3)48.6(0.8)78.8(0.3)11.3(0.6)35.2(1.8)29.7(0.3)30.82Mconfig, model, log
Scenev2t9.1(0.8)25.4(0.9)35.3(1.5)68.2(2.2)23.2(0.3)52.6(2.6)20.1(0.5)7.51Mconfig, model, log
Prev. + Audiov2t10.1(1.2)25.7(1.5)37.5(1.2)69.8(1.6)20.0(1.3)48.9(2.0)21.3(1.1)17.25Mconfig, model, log
Prev. + Inst.v2t12.8(1.3)33.5(2.8)46.6(1.0)76.7(1.7)11.8(0.8)37.6(1.9)27.1(0.6)24.63Mconfig, model, log
Prev. + R2P1Dv2t14.0(0.3)35.4(2.9)47.2(2.8)78.7(2.4)12.3(1.5)35.8(2.4)28.6(1.2)30.82Mconfig, model, log