Model Zoo

April 29, 2025 · View on GitHub

Note

  • For all the pretraining and finetuning, we adopt spaese/uniform sampling.
  • #Frame == #input_frame ×\times #crop ×\times #clip
  • #input_frame means how many frames are input for model per inference
  • #crop means spatial crops (e.g., 3 for left/right/center)
  • #clip means temporal clips (e.g., 4 means repeted sampling four clips with different start indices)

Pretraining

TBD

Distillation

TBD

Finetuning

K710

TBD

K400

ModelSetting#FrameTop-1ModelShell
InternVideo2s1\text{InternVideo2}_{s1}-1BK-Mash PT + K710 FT8x3x491.3:hugs: HF linkTBD
InternVideo2s1\text{InternVideo2}_{s1}-1BK-Mash PT + K710 FT16x3x491.6:hugs: HF linkTBD
InternVideo2s1\text{InternVideo2}_{s1}-6BK-Mash PT + K710 FT8x3x491.9TBDTBD
InternVideo2s1\text{InternVideo2}_{s1}-6BK-Mash PT + K710 FT16x3x492.1TBDTBD
InternVideo2dist\text{InternVideo2}_{dist}-S/14K-Mash PT + K710 FT8x3x485.4:hugs: HF linkTBD
InternVideo2dist\text{InternVideo2}_{dist}-B/14K-Mash PT + K710 FT8x3x488.4:hugs: HF linkTBD
InternVideo2dist\text{InternVideo2}_{dist}-L/14K-Mash PT + K710 FT8x3x490.4:hugs: HF linkTBD
FluxViT\text{FluxViT}-S/14K-Mash PT + K710 FT8x3x487.3Linkrun.sh
FluxViT\text{FluxViT}-B/14K-Mash PT + K710 FT8x3x489.3Linkrun.sh

SthSth V2

TBD