Model Zoo

February 27, 2025 · View on GitHub

Pretraining

For InternVideo2_s2\text{InternVideo2}\_{s2}, we load those models of InternVideo2s1\text{InternVideo2}_{s1} and further pretrain them on multi-modality datasets.

For InternVideo2_clip\text{InternVideo2}\_{clip}, we load those models of InternVideo2s2\text{InternVideo2}_{s2}.

ModelSettingModelPretraining Script
InternVideo2s2\text{InternVideo2}_{s2}-1BIV-25.5M:hugs: HF linkscript
InternVideo2clip\text{InternVideo2}_{clip}-1BIV-25.5M:hugs: HF linkscript
InternVideo2s2\text{InternVideo2}_{s2}-6BIV-400M:hugs: HF linkscript
InternVideo2clip\text{InternVideo2}_{clip}-6BIV-400M:hugs: HF linkscript
InternVideo2s2\text{InternVideo2}_{s2}-S14IV-25.5M Distillation:hugs: HF link-
InternVideo2s2\text{InternVideo2}_{s2}-B14IV-25.5M Distillation:hugs: HF link-
InternVideo2s2\text{InternVideo2}_{s2}-L14IV-25.5M Distillation:hugs: HF link-
InternVideo2clip\text{InternVideo2}_{clip}-S14IV-25.5M Distillation:hugs: HF linkscript
InternVideo2clip\text{InternVideo2}_{clip}-B14IV-25.5M Distillation:hugs: HF linkscript
InternVideo2clip\text{InternVideo2}_{clip}-L14IV-25.5M Distillation:hugs: HF linkscript

Zero-shot Evaluation

Zero-Shot Video-Text Retrieval

ModelDatasetT2VV2TEvaluation Script
InternVideo2s2\text{InternVideo2}_{s2}-1BMSRVTT51.950.9script
LSMDC32.027.3script
DiDeMo57.054.3script
MSVD58.183.3script
ANet60.454.8script
VATEX70.485.4script
InternVideo2s2\text{InternVideo2}_{s2}-6BMSRVTT55.953.7TBD
LSMDC33.830.1TBD
DiDeMo57.957.1TBD
MSVD59.383.1TBD
ANet63.256.5TBD
VATEX71.585.3TBD
ModelDatasetT2VV2TEvaluation Script
InternVideo2clip\text{InternVideo2}_{clip}-1BMSRVTT50.048.4script
LSMDC26.423.1script
DiDeMo47.846.4script
ANet49.446.2script
VATEX_en63.581.2script
VATEX_ch54.976.4script
InternVideo2clip\text{InternVideo2}_{clip}-6BMSRVTT50.950.6script
LSMDC29.426.3script
DiDeMo50.546.8script
ANet50.247.5script
VATEX_en64.182.6script
VATEX_ch54.676.9script
InternVideo2clip\text{InternVideo2}_{clip}-S14MSRVTT35.635.9script
LSMDC14.712.8script
DiDeMo33.735.5script
ANet34.523.6script
VATEX_en49.969.1script
VATEX_ch1.97.6script
InternVideo2clip\text{InternVideo2}_{clip}-B14MSRVTT40.348.5script
LSMDC18.716.5script
DiDeMo40.339.1script
ANet41.538.8script
VATEX_en56.874.5script
VATEX_ch1.88.8script
InternVideo2clip\text{InternVideo2}_{clip}-L14MSRVTT42.144.1script
LSMDC21.418.9script
DiDeMo42.843.2script
ANet43.640.7script
VATEX_en59.675.5script
VATEX_ch1.69.8script

Zero-Shot Action Recognition

ModelDatasettop-1AVGScript
InternVideo2clip\text{InternVideo2}_{clip}-1BK40073.182.4script
K60072.881.8script
K70064.975.2script
UCF10188.8-script
HMDB5153.9-script
MiT31.6-script
SSv2-MC61.5-script
InternVideo2clip\text{InternVideo2}_{clip}-6BK40072.782.2script
K60071.781.2script
K70064.275.2script
UCF10189.5-script
HMDB5156.7-script
MiT32.9-script
SSv2-MC63.5-script
InternVideo2clip\text{InternVideo2}_{clip}-S14K40062.173.6script
K60061.672.5script
K70051.463.4script
UCF10179.1-script
HMDB5149.2-script
MiT24.1-script
SSv2-MC46.4-script
InternVideo2clip\text{InternVideo2}_{clip}-B14K40067.778.0script
K60066.877.0script
K70057.969.3script
UCF10183.4-script
HMDB5152.5-script
MiT27.9-script
SSv2-MC55.9-script
InternVideo2clip\text{InternVideo2}_{clip}-L14K40070.780.5script
K60069.979.6script
K70061.972.9script
UCF10185.9-script
HMDB5153.2-script
MiT30.6-script
SSv2-MC59.6-script
ModelDatasetmAPScript
InternVideo2clip\text{InternVideo2}_{clip}-1BCharades32.9script
InternVideo2clip\text{InternVideo2}_{clip}-6BCharades34.6script
InternVideo2clip\text{InternVideo2}_{clip}-S14Charades21.7script
InternVideo2clip\text{InternVideo2}_{clip}-B14Charades26.1script
InternVideo2clip\text{InternVideo2}_{clip}-L14Charades30.1script