[ICCV 2025] p-MoD: Building Mixture-of-Depths MLLMs via Progressive Ratio Decay

June 26, 2025 ยท View on GitHub

Jun Zhang, Desen Meng, Zhengming Zhang, Zhenpeng Huang, Tao Wu, and Limin Wang.

arXiv model

teaser

We present p-MoD, a series of efficient MLLMs which features:

  • :scissors: Mixture-of-Depths mechanism, upgraded with tanh-gated weight normalization (TanhNorm) and symmetric token reweighting (STRing).
  • :roller_coaster: Progressive ratio decay (PRD) strategy, which gradually reduces the token retention ratio layer by layer.

:fire: News

  • 2025.06.26: Our paper is accepted by ICCV 2025 :tada: :tada:!

:closed_book: Performance and Efficiency

p-MoD matches or even surpasses the performance of the baseline models, with only 55.6% TFLOPs and 53.8% KV cache storage during inference, and 77.7% GPU hours during training.

teaser

teaser

:hammer_and_wrench: Requirements and Installation

  1. Clone this repository and navigate to the folder
git clone https://github.com/MCG-NJU/p-MoD.git
cd p-MoD
  1. Install packages
conda create -n p-mod python=3.10 -y
conda activate p-mod
pip install --upgrade pip  # enable PEP 660 support
pip install -e .
pip install -e lmms-eval
  1. Install additional packages for training cases
pip install -e ".[train]"
pip install flash-attn --no-build-isolation --no-cache-dir
  1. Login to huggingface and wandb
huggingface-cli login
wandb login

:tiger: Model Zoo

ModelLLMEpochPretrain DataSFT Data
p-MoD-LLaVA-NeXT-7BVicuna-7B1558K779K
p-MoD-LLaVA-v1.5-7BVicuna-7B1558K665K

:bar_chart: Evaluation

We evaluate our model using lmms-eval. You can use our script ./scripts/lmms-eval/eval.sh, for example:

bash ./scripts/lmms-eval/eval.sh \
  --ckpt MCG-NJU/p-MoD-LLaVA-NeXT-7B \
  --eval_tasks ai2d,chartqa \
  --project_name pmod \
  --run_name pmod-llava-next-7b-ft

:rocket: Train

Pretraining

We use the pretrained MLP projector provided by LLaVA, which can be downloaded here. Then put the downloaded model weights under ./checkpoints/llava-v1.5-7b-pretrain/llava-official-checkpoint.

p-MoD-LLaVA-NeXT

First, we provide our python script ./util_scripts/download_llava-next_data.py for data preparation. This script downloads the 779K LLaVA-NeXT data, saves the images under ./playground/data/llava_next_images/ and data json to the path ./playground/data/llava_next_data.json.

Then you can start training using ./scripts/train/finetune_eval_7b_pmod_llava_next.sh.

p-MoD-LLaVA-1.5

First, prepare instruction tuning data following LLaVA-1.5. Download the images from constituting datasets, and the dataset annotation json llava_v1_5_mix_665k.json. Save the images and the json under ./playground/data.

Then, we fix some broken examples in the data json by running the script

python util_scripts/clean_data_json.py \
--original_json_path ./playground/data/llava_v1_5_mix665k.json \
--cleaned_json_path ./playground/data/llava_v1_5_mix665k_cleaned.json

Start training with ./scripts/train/finetune_eval_7b_pmod_llava_1_5.sh.

:page_facing_up: Citation

If you find our work helpful for your research and applications, please cite our paper:

@article{zhang2024pmod,
  title={p-MoD: Building Mixture-of-Depths MLLMs via Progressive Ratio Decay},
  author={Zhang, Jun and Meng, Desen and Qi, Ji and Huang, Zhenpeng and Wu, Tao and Wang, Limin},
  journal={arXiv preprint arXiv:2412.04449},
  year={2024}
}

:dizzy: Acknowledgement