Long-VITA: Scaling Large Multi-modal Models to 1 Million Tokens with Leading Short-Context Accuracy

May 14, 2025 · View on GitHub

:fire: News

2025.02.27 🌟 We have an Oneline Demo now.
2025.02.27 🌟 VLMEvalKit of OpenCompass has supported our Long-VITA.
2025.02.17 🌟 We support training on DeepSpeed and inference on Transformer.
2025.02.09 🌟 We support training and inference on Megatron.
2025.02.05 🌟 We release training code, training log, deployment code, and model weights, which support MindSpeed.
2024.02.05 🌟 We are proud to launch Long-VITA, a strong long-context visual language model supporting over one million tokens.

Long Context. Long-VITA can process more than 4K frames or over 1M visual tokens. It achieves state-of-the-art performance on Video-MME under 20B models.
Open Source. Long-VITA is trained on open-source data only, consisting of a mix of 17M samples that are publicly available.
Strong Performance. Long-VITA achieves competitive results on image and video understanding benchmarks among cutting-edge models under 20B parameters.

🐍 Models

Model	LLM Size	Training Context	Training Frames	MindSpeed Weights	Megatron Weights	Huggingface Weights
Long-VITA-16K	14B	16,384	64	https://huggingface.co/VITA-MLLM/Long-VITA-16K	https://huggingface.co/VITA-MLLM/Long-VITA-16K_MG	https://huggingface.co/VITA-MLLM/Long-VITA-16K_HF
Long-VITA-128K	14B	131,072	512	https://huggingface.co/VITA-MLLM/Long-VITA-128K	https://huggingface.co/VITA-MLLM/Long-VITA-128K_MG	https://huggingface.co/VITA-MLLM/Long-VITA-128K_HF
Long-VITA-1M	14B	1,048,576	4,096	https://huggingface.co/VITA-MLLM/Long-VITA-1M	https://huggingface.co/VITA-MLLM/Long-VITA-1M_MG	https://huggingface.co/VITA-MLLM/Long-VITA-1M_HF

We implemented Long-VITA on three frameworks.