[ICCV 2025] p-MoD: Building Mixture-of-Depths MLLMs via Progressive Ratio Decay

June 26, 2025 · View on GitHub

Jun Zhang, Desen Meng, Zhengming Zhang, Zhenpeng Huang, Tao Wu, and Limin Wang.

teaser

We present p-MoD, a series of efficient MLLMs which features:

:scissors: Mixture-of-Depths mechanism, upgraded with tanh-gated weight normalization (TanhNorm) and symmetric token reweighting (STRing).
:roller_coaster: Progressive ratio decay (PRD) strategy, which gradually reduces the token retention ratio layer by layer.

:fire: News

2025.06.26: Our paper is accepted by ICCV 2025 :tada: :tada:!

:closed_book: Performance and Efficiency

p-MoD matches or even surpasses the performance of the baseline models, with only 55.6% TFLOPs and 53.8% KV cache storage during inference, and 77.7% GPU hours during training.

teaser

:hammer_and_wrench: Requirements and Installation

Clone this repository and navigate to the folder

git clone https://github.com/MCG-NJU/p-MoD.git
cd p-MoD

Install packages

conda create -n p-mod python=3.10 -y
conda activate p-mod
pip install --upgrade pip  # enable PEP 660 support
pip install -e .
pip install -e lmms-eval

Install additional packages for training cases

pip install -e ".[train]"
pip install flash-attn --no-build-isolation --no-cache-dir

huggingface-cli login
wandb login

:tiger: Model Zoo

Model	LLM	Epoch	Pretrain Data	SFT Data
p-MoD-LLaVA-NeXT-7B	Vicuna-7B	1	558K	779K
p-MoD-LLaVA-v1.5-7B	Vicuna-7B	1	558K	665K

:bar_chart: Evaluation

We evaluate our model using lmms-eval. You can use our script ./scripts/lmms-eval/eval.sh, for example:

bash ./scripts/lmms-eval/eval.sh \
  --ckpt MCG-NJU/p-MoD-LLaVA-NeXT-7B \
  --eval_tasks ai2d,chartqa \
  --project_name pmod \
  --run_name pmod-llava-next-7b-ft

:rocket: Train

Pretraining

We use the pretrained MLP projector provided by LLaVA, which can be downloaded here. Then put the downloaded model weights under ./checkpoints/llava-v1.5-7b-pretrain/llava-official-checkpoint.

First, we provide our python script ./util_scripts/download_llava-next_data.py for data preparation. This script downloads the 779K LLaVA-NeXT data, saves the images under ./playground/data/llava_next_images/ and data json to the path ./playground/data/llava_next_data.json.

Then you can start training using ./scripts/train/finetune_eval_7b_pmod_llava_next.sh.

p-MoD-LLaVA-1.5

First, prepare instruction tuning data following LLaVA-1.5. Download the images from constituting datasets, and the dataset annotation json llava_v1_5_mix_665k.json. Save the images and the json under ./playground/data.

Then, we fix some broken examples in the data json by running the script

python util_scripts/clean_data_json.py \
--original_json_path ./playground/data/llava_v1_5_mix665k.json \
--cleaned_json_path ./playground/data/llava_v1_5_mix665k_cleaned.json

Start training with ./scripts/train/finetune_eval_7b_pmod_llava_1_5.sh.

:page_facing_up: Citation

If you find our work helpful for your research and applications, please cite our paper:

@article{zhang2024pmod,
  title={p-MoD: Building Mixture-of-Depths MLLMs via Progressive Ratio Decay},
  author={Zhang, Jun and Meng, Desen and Qi, Ji and Huang, Zhenpeng and Wu, Tao and Wang, Limin},
  journal={arXiv preprint arXiv:2412.04449},
  year={2024}
}

:dizzy: Acknowledgement

LLaVA and LLaVA-NeXT: The codebases we built upon.
lmms-eval: We use this amazing framework for evaluation.