README.md

March 28, 2025 · View on GitHub

Sparrow: Efficient Video Fine-tuning Scheme for MLLMs

TL;DR: We proposed a data augmentation method (synthesizing "video" samples from long QA text data) to enrich the instruction diversity of video data, which facilitates more efficient training with comparable performance.

✨ Highlights

🤔 Main findings: The importance of instruction diversity in video fine-tuning and how to efficiently improve it.

We observed a limited instruction diversity in datasets developed for Video-LLMs, which led to low learning efficiency (More details and findings are available in our paper).
Since text data could be a rich and economical source, we leveraged these data in a format that was more consistent with video instruction data.

🚀 Train less, achieve more: By mixing in our synthetic data, one can achieve comparable or better performance with much fewer samples.

🚀 Boost long video understanding "for free": Improvement in long video understanding without training with any long video data.

🛠️ Quick Setup

Create a conda virtual environment and install the required packages.

conda create -n sparrow python=3.9
conda activate sparrow
pip install -r requirements.txt

Install Flash Attention 2 (for efficient training and inference).

pip install -U flash-attn --no-build-isolation

💡 Training & Evaluation

The instructions on training and evaluation (including pre-trained weights) are in TRAIN.md and EVAL.md.

📖 Misc

For those interested in the implementation details of our paper:

How to translate text into images? Check text-to-images.
How to visualize the distribution of instructions?
- Calculate embeddings and perform dimensionality reduction for instructions: calc_inst_embeddings.py.
- Draw plots: vis-tsne.ipynb.

Video-MME: A comprehensive video benchmark that we mainly use in our study.
Awesome-MLLM: A project keeping track of new papers and the latest developments in the field of MLLMs.

🌻 Acknowledgement

Great open-sourced MLLMs and code: MiniCPM-V, Idefics3, InternVL.
Long text instruction data: LongAlpaca and LongQLoRA.

🖋️ Citation

If you find our project useful, please consider citing our paper:

@article{yin2024sparrow,
  title={Sparrow: Data-Efficient Video-LLM with Text-to-Image Augmentation},
  author={Yin, Shukang and Fu, Chaoyou and Zhao, Sirui and Shen, Yunhang and Ge, Chunjiang and Yang, Yan and Long, Zuwei and Dai, Yuhan and Luo, Yongdong and Cao, Haoyu and others},
  journal={arXiv preprint arXiv:2411.19951},
  year={2024}
}