Long-VITA: Scaling Large Multi-modal Models to 1 Million Tokens with Leading Short-Context Accuracy
May 14, 2025 ยท View on GitHub
:fire: News
2025.02.27๐ We have an Oneline Demo now.2025.02.27๐ VLMEvalKit of OpenCompass has supported our Long-VITA.2025.02.17๐ We support training on DeepSpeed and inference on Transformer.2025.02.09๐ We support training and inference on Megatron.2025.02.05๐ We release training code, training log, deployment code, and model weights, which support MindSpeed.2024.02.05๐ We are proud to launch Long-VITA, a strong long-context visual language model supporting over one million tokens.
Contents
โจ Highlights
- Long Context. Long-VITA can process more than 4K frames or over 1M visual tokens. It achieves state-of-the-art performance on Video-MME under 20B models.
- Open Source. Long-VITA is trained on open-source data only, consisting of a mix of 17M samples that are publicly available.
- Strong Performance. Long-VITA achieves competitive results on image and video understanding benchmarks among cutting-edge models under 20B parameters.
๐ Experimental Results
- Comparison of image understanding.
- Comparison of video understanding.
- Effectiveness of Logits-Masked LM Head.
๐ Models
| Model | LLM Size | Training Context | Training Frames | MindSpeed Weights | Megatron Weights | Huggingface Weights |
|---|---|---|---|---|---|---|
| Long-VITA-16K | 14B | 16,384 | 64 | https://huggingface.co/VITA-MLLM/Long-VITA-16K | https://huggingface.co/VITA-MLLM/Long-VITA-16K_MG | https://huggingface.co/VITA-MLLM/Long-VITA-16K_HF |
| Long-VITA-128K | 14B | 131,072 | 512 | https://huggingface.co/VITA-MLLM/Long-VITA-128K | https://huggingface.co/VITA-MLLM/Long-VITA-128K_MG | https://huggingface.co/VITA-MLLM/Long-VITA-128K_HF |
| Long-VITA-1M | 14B | 1,048,576 | 4,096 | https://huggingface.co/VITA-MLLM/Long-VITA-1M | https://huggingface.co/VITA-MLLM/Long-VITA-1M_MG | https://huggingface.co/VITA-MLLM/Long-VITA-1M_HF |
โญ Training, Inference and Evaluation
We implemented Long-VITA on three frameworks.