Model Zoo

April 29, 2025 · View on GitHub

Note

For all the pretraining and finetuning, we adopt spaese/uniform sampling.
#Frame $=$ #input_frame $\times$ #crop $\times$ #clip
#input_frame means how many frames are input for model per inference
#crop means spatial crops (e.g., 3 for left/right/center)
#clip means temporal clips (e.g., 4 means repeted sampling four clips with different start indices)

Pretraining

TBD

Distillation

TBD

Finetuning

K710

TBD

K400

Model	Setting	#Frame	Top-1	Model	Shell
$\text{InternVideo2}_{s1}$ -1B	K-Mash PT + K710 FT	8x3x4	91.3	:hugs: HF link	TBD
$\text{InternVideo2}_{s1}$ -1B	K-Mash PT + K710 FT	16x3x4	91.6	:hugs: HF link	TBD
$\text{InternVideo2}_{s1}$ -6B	K-Mash PT + K710 FT	8x3x4	91.9	TBD	TBD
$\text{InternVideo2}_{s1}$ -6B	K-Mash PT + K710 FT	16x3x4	92.1	TBD	TBD
$\text{InternVideo2}_{dist}$ -S/14	K-Mash PT + K710 FT	8x3x4	85.4	:hugs: HF link	TBD
$\text{InternVideo2}_{dist}$ -B/14	K-Mash PT + K710 FT	8x3x4	88.4	:hugs: HF link	TBD
$\text{InternVideo2}_{dist}$ -L/14	K-Mash PT + K710 FT	8x3x4	90.4	:hugs: HF link	TBD
$\text{FluxViT}$ -S/14	K-Mash PT + K710 FT	8x3x4	87.3	Link	run.sh
$\text{FluxViT}$ -B/14	K-Mash PT + K710 FT	8x3x4	89.3	Link	run.sh

SthSth V2

TBD