Free-MoRef: Instantly Multiplexing Context Perception Capabilities of Video-MLLMs within Single Inference (ICCV 2025)

August 4, 2025 · View on GitHub

Notes

This paper increases the context preception capability of Qwen2 by spliting the long vision tokens into parallel short sequences and apply MoRef-Attention to get unified activation. We develop the project based on LLaVA-NEXT and lmms-eval. The main changes that is related to the Free-MoRef is the following files:

lmms-eval-main/lmms_eval/models/llava_vid.py
LLaVA-NeXT-main/llava/model/llava_arch.py
LLaVA-NeXT-main/llava/model/language_model/llava_qwen.py
LLaVA-NeXT-main/llava/model/language_model/qwen2_local/modeling_qwen2.py
LLaVA-NeXT-main/llava/model/language_model/qwen2_local/cache_utils.py

Models & Scripts

Installation

1. Clone this repository and navigate to the main folder:

git clone https://github.com/wkfdb/Free-MoRef
cd Free-MoRef

2. Install the inference package:

conda create -n llava python=3.10 -y
conda activate llava
pip install --upgrade pip
cd LLaVA-NeXT-main # enter the LLaVA-NeXT-main folder
pip install -e ".[train]"
cd lmms-eval-main # enter the lmms-eval-main folder
pip install -e .

Data Preparation

Please download the datasets (e.g. Video-MME), LLaVA-Video-7B-Qwen2, and siglip-so400m-patch14-384.

- Modify the following files to set the path of the downloaded dataset and model

LLaVA-NEXT-main/run_eval.sh : --model_args pretrained=path_to_your_llava-video-7b-qwen2
LLaVA-NeXT-main/llava/model/multimodal_encoder/siglip_encoder.py : in load_model(), set the path veriable to your siglip-so400m-patch14-384 path.
lmms-eval-main/lmms_eval/tasks/videomme/videomme.yaml : Set the dataset_path to your downloaded VideoMME path. The same for other datasets (e.g. MLVU, LongVideoBench).

Run Evaluation

cd LLaVA-NeXT-main # enter the LLaVA-NeXT-main folder
bash run_eval.sh

Citation

If you find it useful for your research and applications, please cite related paper using this BibTeX:

@inproceedings{wang2025MoRef,
  title={Free-MoRef: Instantly Multiplexing Context Perception Capabilities of Video-MLLMs within Single Inference},
  author={Wang, Kuo and Zheng, Quanlong and Xie, Junlin and Zhang, Yanhao and Luo, Jinguo and Lu, Haonan and Lin, Liang and Zhou, Fan and Li Guanbin},
  booktitle={Proceedings of the IEEE/CVF International Conference on Computer Vision},
  year={2025}
}