LinVT: Empower Your Image-level Large Language Model to Understand Videos

December 30, 2024 ยท View on GitHub

News

[2024/12/09] ๐Ÿ”ฅ Our paper is coming! We release our paper on Arxiv. Please refer to the paper for more details.

Our method achieves the following rankings with only a 7B-size model:

PWC PWC PWC PWC PWC PWC PWC

Leaderboards

VideoVista

VideoVista

MLVU

MLVU

Model Architecture

๐Ÿ“– Abstract

Large Language Models (LLMs) have been widely used in various tasks, motivating us to develop an LLM-based assistant for videos. Instead of training from scratch, we propose a module to transform arbitrary well-trained image-based LLMs into video-LLMs (after being trained on video data). To better adapt image-LLMs for processing videos, we introduce two design principles: linear transformation to preserve the original visual-language alignment and representative information condensation from redundant video content. Guided by these principles, we propose the Linear Video Tokenizer (LinVT), which enables existing image-LLMs to understand videos. We benchmark LinVT with six recent visual LLMs: Blip-3, Molmo, Mipha, InternVL2, Qwen2-VL and Aquila, show-casing the high compatibility of LinVT. Extensive experiments illustrate the effectiveness of LinVT in multi-modal video understanding while preserving the original image-comprehension capabilities.

Installation

Install required packages.

conda create -n LinVT python=3.10.13
conda activate LinVT
conda install pytorch==1.13.1 torchvision==0.14.1 torchaudio==0.13.1 -c pytorch -c conda-forge -y
pip install -r requirements.txt

Model weights

comming soon.

Inference

sh evaluate.sh [model_weight] [task_name] --dynamic

Training

comming soon.

Citation

If you find this repository useful, please consider giving a star โญ and citation:

@article{gao2024linvt,
  title={LinVT: Empower Your Image-level Large Language Model to Understand Videos},
  author={Gao, Lishuai and Zhong, Yujie and Zeng, Yingsen and Tan, Haoxian and Li, Dengjie and Zhao, Zheng},
  journal={arXiv preprint arXiv:2412.05185},
  year={2024}
}