[ICCV 2025] ShortV
March 30, 2026 ยท View on GitHub
Code release for ICCV 2025 conference paper "ShortV: Efficient Multimodal Large Language Models by Freezing Visual Tokens in Ineffective Layers"
Usage and License Notices: This project utilizes certain datasets and checkpoints that are subject to their respective original licenses. Users must comply with all terms and conditions of these original licenses. This project does not impose any additional constraints beyond those stipulated in the original licenses. Furthermore, users are reminded to ensure that their use of the dataset and checkpoints is in compliance with all applicable laws and regulations.
Install
- Clone this repository and navigate to ShortV folder
git clone https://github.com/icip-cas/ShortV.git
cd ShortV
- Install Package
conda create -n shortv python=3.10 -y
conda activate shortv
pip install --upgrade pip # enable PEP 660 support
pip install -e .
- Install additional packages for evaluation with lmms-eval
cd lmms-eval
pip install -e .
ShortV Inference and Evaluation
Replaced Layers
The layer ids of replaced layers are provided below.
| Model | Checkpoint | Replaced Layers |
|---|---|---|
| LLaVA-1.5-7B | liuhaotian/llava-v1.5-7b | 31,29,30,28,0,26,27,25,24,22,23,21,2,3,20,18,17,12,19 |
| LLaVA-1.5-13B | liuhaotian/llava-v1.5-13b | 39,32,28,36,27,37,29,30,1,38,25,31,2,26,23,34,0,33,35,22,24,21,20,17 |
| LLaVA-NeXT-7B | liuhaotian/llava-v1.6-vicuna-7b | 31,29,30,28,26,27,22,24,21,23,25,20,19,17,18,15,12,0,2 |
| LLaVA-NeXT-13B | liuhaotian/llava-v1.6-vicuna-13b | 39,32,29,36,27,30,37,23,25,31,26,2,28,22,33,35,34,24,38,21,20,18,1,17 |
Chatbot Inference
Chat about images using ShortV.
export REPLACED_LAYERS="31,29,30,28,0,26,27,25,24,22,23,21,2,3,20,18,17,12,19"
python -m llava.serve.cli \
--model-path liuhaotian/llava-v1.5-7b \
--image-file "https://llava-vl.github.io/static/images/view.jpg"
Evaluation with LMMs-Eval
LMMs-Eval is an evaluation framework meticulously crafted for consistent and efficient evaluation of LMM.
export MODEL_PATH="liuhaotian/llava-v1.5-7b"
export MODEL_NAME="llava_7b"
export CONV_MODE="v1"
export REPLACED_LAYERS="31,29,30,28,0,26,27,25,24,22,23,21,2,3,20,18,17,12,19"
accelerate launch --num_processes=1 --main_process_port=12346 -m lmms_eval \
--model llava \
--model_args pretrained=${MODEL_PATH},conv_template=${CONV_MODE} \
--tasks mmmu_val \
--batch_size 1 \
--log_samples_suffix ${MODEL_NAME} \
--output_path ./logs/
Evaluation with Scripts From LLaVA
See Evaluation.md.
Calculating LC Scores and Identifying Ineffective Layers
To identify which layers are ineffective, we calculate visual LC scores for all MLLM layers.
cd lmms-eval
export MODEL_PATH="liuhaotian/llava-v1.5-7b"
export MODEL_NAME="llava_7b"
export CONV_MODE="v1"
accelerate launch --num_processes=1 --main_process_port=12346 -m lmms_eval \
--model llava \
--model_args pretrained=${MODEL_PATH},conv_template=${CONV_MODE} \
--tasks gqa,flickr30k_test \
--batch_size 1 \
--log_samples_suffix ${MODEL_NAME} \
--output_path ./logs/ \
--limit 20 \
--cal_lc
You will get visual LC scores for each layer, and the order of layer replacement.
Acknowledge
This work is built upon the LLaVA, lmms-eval, and VTW
Citation
If you find ShortV useful for your research and applications, please cite using this BibTeX:
@inproceedings{yuan2025shortv,
title={Shortv: Efficient multimodal large language models by freezing visual tokens in ineffective layers},
author={Yuan, Qianhao and Zhang, Qingyu and Liu, Yanjiang and Chen, Jiawei and Lu, Yaojie and Lin, Hongyu and Zheng, Jia and Han, Xianpei and Sun, Le},
booktitle={Proceedings of the IEEE/CVF International Conference on Computer Vision},
pages={329--339},
year={2025}
}