Free-MoRef: Instantly Multiplexing Context Perception Capabilities of Video-MLLMs within Single Inference (ICCV 2025)
August 4, 2025 ยท View on GitHub
Notes
This paper increases the context preception capability of Qwen2 by spliting the long vision tokens into parallel short sequences and apply MoRef-Attention to get unified activation. We develop the project based on LLaVA-NEXT and lmms-eval. The main changes that is related to the Free-MoRef is the following files:
- lmms-eval-main/lmms_eval/models/llava_vid.py
- LLaVA-NeXT-main/llava/model/llava_arch.py
- LLaVA-NeXT-main/llava/model/language_model/llava_qwen.py
- LLaVA-NeXT-main/llava/model/language_model/qwen2_local/modeling_qwen2.py
- LLaVA-NeXT-main/llava/model/language_model/qwen2_local/cache_utils.py
Models & Scripts
Installation
1. Clone this repository and navigate to the main folder:
git clone https://github.com/wkfdb/Free-MoRef
cd Free-MoRef
2. Install the inference package:
conda create -n llava python=3.10 -y
conda activate llava
pip install --upgrade pip
cd LLaVA-NeXT-main # enter the LLaVA-NeXT-main folder
pip install -e ".[train]"
cd lmms-eval-main # enter the lmms-eval-main folder
pip install -e .
Data Preparation
Please download the datasets (e.g. Video-MME), LLaVA-Video-7B-Qwen2, and siglip-so400m-patch14-384.
- Modify the following files to set the path of the downloaded dataset and model
- LLaVA-NEXT-main/run_eval.sh :
--model_args pretrained=path_to_your_llava-video-7b-qwen2 - LLaVA-NeXT-main/llava/model/multimodal_encoder/siglip_encoder.py : in
load_model(), set thepathveriable to yoursiglip-so400m-patch14-384path. - lmms-eval-main/lmms_eval/tasks/videomme/videomme.yaml : Set the
dataset_pathto your downloaded VideoMME path. The same for other datasets (e.g. MLVU, LongVideoBench).
Run Evaluation
cd LLaVA-NeXT-main # enter the LLaVA-NeXT-main folder
bash run_eval.sh
Citation
If you find it useful for your research and applications, please cite related paper using this BibTeX:
@inproceedings{wang2025MoRef,
title={Free-MoRef: Instantly Multiplexing Context Perception Capabilities of Video-MLLMs within Single Inference},
author={Wang, Kuo and Zheng, Quanlong and Xie, Junlin and Zhang, Yanhao and Luo, Jinguo and Lu, Haonan and Lin, Liang and Zhou, Fan and Li Guanbin},
booktitle={Proceedings of the IEEE/CVF International Conference on Computer Vision},
year={2025}
}