Free-MoRef: Instantly Multiplexing Context Perception Capabilities of Video-MLLMs within Single Inference (ICCV 2025)

August 4, 2025 ยท View on GitHub

Notes

This paper increases the context preception capability of Qwen2 by spliting the long vision tokens into parallel short sequences and apply MoRef-Attention to get unified activation. We develop the project based on LLaVA-NEXT and lmms-eval. The main changes that is related to the Free-MoRef is the following files:

  • lmms-eval-main/lmms_eval/models/llava_vid.py
  • LLaVA-NeXT-main/llava/model/llava_arch.py
  • LLaVA-NeXT-main/llava/model/language_model/llava_qwen.py
  • LLaVA-NeXT-main/llava/model/language_model/qwen2_local/modeling_qwen2.py
  • LLaVA-NeXT-main/llava/model/language_model/qwen2_local/cache_utils.py

Models & Scripts

Installation

1. Clone this repository and navigate to the main folder:

git clone https://github.com/wkfdb/Free-MoRef
cd Free-MoRef

2. Install the inference package:

conda create -n llava python=3.10 -y
conda activate llava
pip install --upgrade pip
cd LLaVA-NeXT-main # enter the LLaVA-NeXT-main folder
pip install -e ".[train]"
cd lmms-eval-main # enter the lmms-eval-main folder
pip install -e .

Data Preparation

Please download the datasets (e.g. Video-MME), LLaVA-Video-7B-Qwen2, and siglip-so400m-patch14-384.

- Modify the following files to set the path of the downloaded dataset and model

  • LLaVA-NEXT-main/run_eval.sh : --model_args pretrained=path_to_your_llava-video-7b-qwen2
  • LLaVA-NeXT-main/llava/model/multimodal_encoder/siglip_encoder.py : in load_model(), set the path veriable to your siglip-so400m-patch14-384 path.
  • lmms-eval-main/lmms_eval/tasks/videomme/videomme.yaml : Set the dataset_path to your downloaded VideoMME path. The same for other datasets (e.g. MLVU, LongVideoBench).

Run Evaluation

cd LLaVA-NeXT-main # enter the LLaVA-NeXT-main folder
bash run_eval.sh

Citation

If you find it useful for your research and applications, please cite related paper using this BibTeX:

@inproceedings{wang2025MoRef,
  title={Free-MoRef: Instantly Multiplexing Context Perception Capabilities of Video-MLLMs within Single Inference},
  author={Wang, Kuo and Zheng, Quanlong and Xie, Junlin and Zhang, Yanhao and Luo, Jinguo and Lu, Haonan and Lin, Liang and Zhou, Fan and Li Guanbin},
  booktitle={Proceedings of the IEEE/CVF International Conference on Computer Vision},
  year={2025}
}