Long-VITA: Scaling Large Multi-modal Models to 1 Million Tokens with Leading Short-Context Accuracy

May 14, 2025 ยท View on GitHub

:fire: News

  • 2025.02.27 ๐ŸŒŸ We have an Oneline Demo now.
  • 2025.02.27 ๐ŸŒŸ VLMEvalKit of OpenCompass has supported our Long-VITA.
  • 2025.02.17 ๐ŸŒŸ We support training on DeepSpeed and inference on Transformer.
  • 2025.02.09 ๐ŸŒŸ We support training and inference on Megatron.
  • 2025.02.05 ๐ŸŒŸ We release training code, training log, deployment code, and model weights, which support MindSpeed.
  • 2024.02.05 ๐ŸŒŸ We are proud to launch Long-VITA, a strong long-context visual language model supporting over one million tokens.

Contents

โœจ Highlights

  • Long Context. Long-VITA can process more than 4K frames or over 1M visual tokens. It achieves state-of-the-art performance on Video-MME under 20B models.
  • Open Source. Long-VITA is trained on open-source data only, consisting of a mix of 17M samples that are publicly available.
  • Strong Performance. Long-VITA achieves competitive results on image and video understanding benchmarks among cutting-edge models under 20B parameters.

๐Ÿ“ˆ Experimental Results

  • Comparison of image understanding.

image image

  • Comparison of video understanding.

image

image

  • Effectiveness of Logits-Masked LM Head.

image

๐Ÿ Models

ModelLLM SizeTraining ContextTraining FramesMindSpeed WeightsMegatron WeightsHuggingface Weights
Long-VITA-16K14B16,38464https://huggingface.co/VITA-MLLM/Long-VITA-16Khttps://huggingface.co/VITA-MLLM/Long-VITA-16K_MGhttps://huggingface.co/VITA-MLLM/Long-VITA-16K_HF
Long-VITA-128K14B131,072512https://huggingface.co/VITA-MLLM/Long-VITA-128Khttps://huggingface.co/VITA-MLLM/Long-VITA-128K_MGhttps://huggingface.co/VITA-MLLM/Long-VITA-128K_HF
Long-VITA-1M14B1,048,5764,096https://huggingface.co/VITA-MLLM/Long-VITA-1Mhttps://huggingface.co/VITA-MLLM/Long-VITA-1M_MGhttps://huggingface.co/VITA-MLLM/Long-VITA-1M_HF

โญ Training, Inference and Evaluation

We implemented Long-VITA on three frameworks.