For InternVideo2_s2, we load those models of InternVideo2s1 and further pretrain them on multi-modality datasets.
For InternVideo2_clip, we load those models of InternVideo2s2.
| Model | Setting | Model | Pretraining Script |
|---|
| InternVideo2s2-1B | IV-25.5M | :hugs: HF link | script |
| InternVideo2clip-1B | IV-25.5M | :hugs: HF link | script |
| InternVideo2s2-6B | IV-400M | :hugs: HF link | script |
| InternVideo2clip-6B | IV-400M | :hugs: HF link | script |
| InternVideo2s2-S14 | IV-25.5M Distillation | :hugs: HF link | - |
| InternVideo2s2-B14 | IV-25.5M Distillation | :hugs: HF link | - |
| InternVideo2s2-L14 | IV-25.5M Distillation | :hugs: HF link | - |
| InternVideo2clip-S14 | IV-25.5M Distillation | :hugs: HF link | script |
| InternVideo2clip-B14 | IV-25.5M Distillation | :hugs: HF link | script |
| InternVideo2clip-L14 | IV-25.5M Distillation | :hugs: HF link | script |
| Model | Dataset | T2V | V2T | Evaluation Script |
|---|
| InternVideo2s2-1B | MSRVTT | 51.9 | 50.9 | script |
| LSMDC | 32.0 | 27.3 | script |
| DiDeMo | 57.0 | 54.3 | script |
| MSVD | 58.1 | 83.3 | script |
| ANet | 60.4 | 54.8 | script |
| VATEX | 70.4 | 85.4 | script |
| InternVideo2s2-6B | MSRVTT | 55.9 | 53.7 | TBD |
| LSMDC | 33.8 | 30.1 | TBD |
| DiDeMo | 57.9 | 57.1 | TBD |
| MSVD | 59.3 | 83.1 | TBD |
| ANet | 63.2 | 56.5 | TBD |
| VATEX | 71.5 | 85.3 | TBD |
| Model | Dataset | T2V | V2T | Evaluation Script |
|---|
| InternVideo2clip-1B | MSRVTT | 50.0 | 48.4 | script |
| LSMDC | 26.4 | 23.1 | script |
| DiDeMo | 47.8 | 46.4 | script |
| ANet | 49.4 | 46.2 | script |
| VATEX_en | 63.5 | 81.2 | script |
| VATEX_ch | 54.9 | 76.4 | script |
| InternVideo2clip-6B | MSRVTT | 50.9 | 50.6 | script |
| LSMDC | 29.4 | 26.3 | script |
| DiDeMo | 50.5 | 46.8 | script |
| ANet | 50.2 | 47.5 | script |
| VATEX_en | 64.1 | 82.6 | script |
| VATEX_ch | 54.6 | 76.9 | script |
| InternVideo2clip-S14 | MSRVTT | 35.6 | 35.9 | script |
| LSMDC | 14.7 | 12.8 | script |
| DiDeMo | 33.7 | 35.5 | script |
| ANet | 34.5 | 23.6 | script |
| VATEX_en | 49.9 | 69.1 | script |
| VATEX_ch | 1.9 | 7.6 | script |
| InternVideo2clip-B14 | MSRVTT | 40.3 | 48.5 | script |
| LSMDC | 18.7 | 16.5 | script |
| DiDeMo | 40.3 | 39.1 | script |
| ANet | 41.5 | 38.8 | script |
| VATEX_en | 56.8 | 74.5 | script |
| VATEX_ch | 1.8 | 8.8 | script |
| InternVideo2clip-L14 | MSRVTT | 42.1 | 44.1 | script |
| LSMDC | 21.4 | 18.9 | script |
| DiDeMo | 42.8 | 43.2 | script |
| ANet | 43.6 | 40.7 | script |
| VATEX_en | 59.6 | 75.5 | script |
| VATEX_ch | 1.6 | 9.8 | script |
| Model | Dataset | mAP | Script |
|---|
| InternVideo2clip-1B | Charades | 32.9 | script |
| InternVideo2clip-6B | Charades | 34.6 | script |
| InternVideo2clip-S14 | Charades | 21.7 | script |
| InternVideo2clip-B14 | Charades | 26.1 | script |
| InternVideo2clip-L14 | Charades | 30.1 | script |