README.md
June 11, 2025 ยท View on GitHub
TinyLLaVA-Video
๐ News
- [2025-04] ๐ Our new work TinyLLaVA-Video-R1 for video reasoning is released!
- [2025-01] ๐ Our arXiv paper TinyLLaVA-Video: Towards Smaller LMMs for Video Understanding with Group Resampler is released!
- [2024-12] ๐ Our TinyLLaVA-Video repository is released!
๐ About
This is a framework of Small-scale Large Multimodal Models for video understanding based on TinyLLaVA_Factory.
- The model with parameters not exceeding 4B that processes video sequences in a simple manner, without the need for complex architectures, supporting both fps sampling and uniform frame sampling.
- We validate the effectiveness of this framework through experiments, the best model achieving performance comparable to certain existing 7B models on multiple video understanding benchmarks.
- It requires only one day of training on 8 A100-40G GPUs.
Installation and Requirements
- Clone this repository and navigate to the folder
git clone https://github.com/ZhangXJ199/TinyLLaVA-Video.git
cd TinyLLaVA-Video
- Create a conda environment, activate it and install Packages
conda create -n tinyllava_video python=3.10 -y
conda activate tinyllava_video
pip install --upgrade pip # enable PEP 660 support
pip install -e .
- Install additional packages
pip install flash-attn --no-build-isolation
Upgrade to the latest code base
git pull
pip install -e .
Get Started
1. Data Preparation
We combine partial data from two datasets: LLaVA-Video-178K and Valley.
| Stage | Source | #Sample |
|---|---|---|
| Pretrain | LLaVA-Video-178K + Valley | 397k |
| Finetune | LLaVA-Video-178K | 491k |
Pretrain Data
We use four subsets of LLaVA-Video-178K: 0_30_s_academic_v0_1, 30_60_s_academic_v0_1, 0_30_s_youtube_v0_1, and 30_60_s_youtube_v0_1, supplemented with the filtered Video-LLaVA. The organized pretraining annotations can be downloaded from here.
Finetune Data
We use four subsets of LLaVA-Video-178K: 0_30_s_academic_v0_1, 30_60_s_academic_v0_1, 0_30_s_youtube_v0_1, and 30_60_s_youtube_v0_1. The organized finetune annotations can be downloaded from here.
Organize Data
Organize the files and annotation files as follows in path/to/your/dataset:
dataset
โโโ academic_source
โโโ liwei_youtube_videos
โโโ valley
โโโ text_files
โ โโโ cleaned_video_caption.json
โ โโโ cleaned_video_openqa.json
2. Train
You can refer to TinyLLaVA_Factory to modify components such as "llm," "vision_tower," and "train_recipe."
Here's an example for training a LMM using Qwen2.5.
- Replace data paths with yours in
scripts/train/qwen2/train_qwen2_base_video.sh - Replace
output_dirwith yours inscripts/train/qwen2/pretrain_qwen2_video.sh - Replace
pretrained_model_pathandoutput_dirwith yours inscripts/train/qwen2/finetune_qwen2_video.sh - Adjust your GPU ids (localhost) and
per_device_train_batch_sizeinscripts/train/qwen2/pretrain_qwen2_video.shandscripts/train/qwen2/finetune_qwen2_video.sh
bash scripts/train/qwen2/train_qwen2_base_video.sh
Important hyperparameters used in pretraining and finetuning are provided below.
| Training Stage | Global Batch Size | Learning rate | conv_version |
|---|---|---|---|
| Pretraining | 128 | 1e-4 | pretrain |
| Finetuning | 64 | 2e-5 | qwen2_base |
Tips:
Global Batch Size = num of GPUs * per_device_train_batch_size * gradient_accumulation_steps, we recommand you always keep global batch size and learning rate as above except for lora tuning your model.
3. Evaluation
We currently provide evaluations on 5 benchmarks, including Video-MME, MVBench, LongVideoBench, MLVU, MMVU.
Video-MME
- Download Video-MME and put it under
path/to/your/dataset/eval/Video-MME. - Please change
MODEL_PATH,MODEL_NAME,EVAL_DIR,conv-modeanddurationinscripts/eval/videomme.sh. There are three types ofdurationavailable for testing:short,medium, andlong. - Please use the following command for single-gpu inference.
CUDA_VISIBLE_DEVICES=0 bash scripts/eval/videomme.sh
MVBench
- Download MVBench and put it under
path/to/your/dataset/eval/MVBench. - Please change
MODEL_PATH,MODEL_NAME,EVAL_DIRandconv-modeinscripts/eval/mvbench.sh. - Please use the following command for single-gpu inference.
CUDA_VISIBLE_DEVICES=0 bash scripts/eval/mvbench.sh
LongVideoBench
- Download LongVideoBench and put it under
path/to/your/dataset/eval/LongVideoBench. - Please change
MODEL_PATH,MODEL_NAME,EVAL_DIRandconv-modeinscripts/eval/lvbench.sh. - Please use the following command for single-gpu inference.
CUDA_VISIBLE_DEVICES=0 bash scripts/eval/lvbench.sh
MLVU
- Download MLVU and put it under
path/to/your/dataset/eval/MLVU. - Please change
MODEL_PATH,MODEL_NAME,EVAL_DIRandconv-modeinscripts/eval/mlvu.sh. - Please use the following command for single-gpu inference.
CUDA_VISIBLE_DEVICES=0 bash scripts/eval/mlvu.sh
MMVU
- Download MMVU and put it under
path/to/your/dataset/eval/MMVU. - Please change
MODEL_PATH,MODEL_NAME,EVAL_DIRandconv-modeinscripts/eval/mmvu.sh. - Please use the following command for single-gpu inference.
CUDA_VISIBLE_DEVICES=0 bash scripts/eval/mmvu.sh
Model Zoo
Trained Models
Video-Level Group Resample
Naive Video-Level Resample
Here, 16 represents sampling 16 frames, and 512 represents using 512 tokens(queries) to represent the video sequence. In the script, you can set num_frame, max_frame = 16 for "16-512", and set num_frame = -1, max_frame = 64 or 128 for "1fps-512".
Model Performance
Video-Level Group Resample
| VT (HF Path) | LLM (HF Path) | #Frame/Query | Video-MME | MVBench | LongVideoBench | MLVU | MMVU |
|---|---|---|---|---|---|---|---|
| google/siglip-so400m-patch14-384 | Qwen/Qwen2.5-3B | 16/512 | 47.0 | 45.5 | 42.4 | 52.5 | 34.3 |
| google/siglip-so400m-patch14-384 | Qwen/Qwen2.5-3B | 1fps/512 | 47.7 | 47.0 | 42.0 | 52.6 | 36.0 |
Naive Video-Level Resample
| VT (HF Path) | LLM (HF Path) | #Frame/Query | Video-MME | MVBench | LongVideoBench | MLVU | MMVU |
|---|---|---|---|---|---|---|---|
| google/siglip-so400m-patch14-384 | Qwen/Qwen2.5-3B | 16/512 | 44.7 | 42.5 | 37.6 | 48.1 | 34.1 |
| google/siglip-so400m-patch14-384 | microsoft/phi-2 | 16/512 | 42.7 | 42.0 | 42.2 | 46.5 | 31.6 |
Quick Inference Scripts
- Please change
model_path,prompt,video_fileandconv-modeineval.py. - Please use the following command for single-gpu inference.
CUDA_VISIBLE_DEVICES=0 python eval.py
โ Citation
If you find our paper and code useful in your research, please consider giving a star :star: and citation :pencil:.
@article{zhang2025tinyllava,
title={TinyLLaVA-Video: A Simple Framework of Small-scale Large Multimodal Models for Video Understanding},
author={Zhang, Xingjian and Weng, Xi and Yue, Yihao and Fan, Zhaoxin and Wu, Wenjun and Huang, Lei},
journal={arXiv preprint arXiv:2501.15513},
year={2025}
}
@article{jia2024tinyllava,
title={TinyLLaVA Factory: A Modularized Codebase for Small-scale Large Multimodal Models},
author={Jia, Junlong and Hu, Ying and Weng, Xi and Shi, Yiming and Li, Miao and Zhang, Xingjian and Zhou, Baichuan and Liu, Ziyu and Luo, Jie and Huang, Lei and Wu, Ji},
journal={arXiv preprint arXiv:2405.11788},
year={2024}
}
โค๏ธ Community efforts
- This repository is based on TinyLLaVA_Factory project.
- Our codebase is built upon the LLaVA project. Great work!