[CVPR 2026 Findings] FastMMoE: Accelerating Multimodal Large Language Models through Dynamic Expert Activation and Routing-Aware Token Pruning

March 20, 2026 Β· View on GitHub

[Paper] [πŸ‡¨πŸ‡³ δΈ­ζ–‡η‰ˆ]

FastMMoE is a training-free acceleration framework tailored for Mixture-of-Experts (MoE) based Multimodal Large Language Models (MLLMs). By leveraging Dynamic Expert Activation and Routing-Aware Token Pruning, FastMMoE significantly reduces inference latency and computational costs (FLOPs) while maintaining model performance.

This repository contains the official implementation of FastMMoE, supporting InternVL3.5 and DeepSeek-VL2.


πŸ“‚ Repository Structure

The code is organized into three main directories, each serving a specific purpose:

β”œβ”€β”€ DeepSeek-VL2       # Implementation for DeepSeek-VL2 (Performance Benchmark)
β”‚Β  Β β”œβ”€β”€ DeepSeek_VL2_paper.pdf
β”‚Β  Β β”œβ”€β”€ VLMEvalKit     # Modified VLMEvalKit for DeepSeek-VL2 evaluation
β”‚Β  Β β”œβ”€β”€ deepseek_vl2   # Core model code
β”‚Β   └── ...
β”œβ”€β”€ Internvl3_5        # Implementation for InternVL3.5 (Performance Benchmark)
β”‚Β   └── VLMEvalKit     # Modified VLMEvalKit for InternVL3.5 evaluation
β”œβ”€β”€ lmms-eval          # Latency/Speed Benchmark (Supports Batch Size > 1)
└── README.md

  • Internvl3_5 & DeepSeek-VL2: These folders are used for Performance/Accuracy Benchmarking (e.g., MMMU, OCRBench) using VLMEvalKit.
  • lmms-eval: This folder is dedicated to Latency/Speed Testing (specifically for prefill latency), as the standard VLMEvalKit does not support batch_size > 1.
  • Note: lmms-eval should only be used for speed measurement; the accuracy metrics generated here may not be representative.

πŸ› οΈ Installation & Environment

Important: This repository requires three separate virtual environments corresponding to the three main folders to avoid dependency conflicts.

1. Environment for InternVL3.5

Used for performance evaluation of InternVL3.5.

conda create -n internvl3 python=3.10 -y

cd Internvl3_5/VLMEvalKit
pip install -e .

pip install flash-attn --no-build-isolation

2. Environment for DeepSeek-VL2

Used for performance evaluation of DeepSeek-VL2. Note the two-step installation.

conda create -n deepseek_vl python=3.10 -y

cd DeepSeek-VL2
pip install -e .
cd VLMEvalKit
pip install -e .

pip install flash-attn --no-build-isolation

3. Environment for Latency Testing (lmms-eval)

Used for latency measurement.

conda create -n lmms_test python=3.10 -y

cd lmms-eval
pip install -e .

pip install flash-attn --no-build-isolation

βš™οΈ Configuration

Before running the evaluation scripts, you must configure the local paths to your model checkpoints.

For InternVL3.5: Modify Internvl3_5/VLMEvalKit/vlmeval/config.py:

"InternVL3_5-30B-A3B": partial(
    InternVLChat, 
    model_path="/path/to/your/models/opengvlab__internvl3_5-30b-a3b", # <--- Modify this
    version="V2.0"
),

For DeepSeek-VL2: Modify DeepSeek-VL2/VLMEvalKit/vlmeval/config.py:

deepseekvl2_series = {
    "deepseek_vl2_tiny": partial(
        DeepSeekVL2, model_path="/path/to/your/models/deepseek-vl2-tiny" # <--- Modify this
    ),
    # ... modify other sizes as needed
    "deepseek_vl2": partial(DeepSeekVL2, model_path="/path/to/your/models/deepseek-vl2"),
}


πŸš€ Usage

1. InternVL3.5 Performance Evaluation

Run the script located at Internvl3_5/VLMEvalKit/scripts/run_internvl.sh.

Key Hyperparameters:

  • REDUCTION_LAYER_IDX: The starting layer index for expert activation reduction.
  • VISION_EXPERT_REDUCE_FACTOR: The ratio of experts to keep (e.g., 0.5 means halving the active experts).
  • MERGE_RATIO: The merge ratio .
  • KEEP_TOKEN_RATIO: Target retention ratio (e.g., 0.75, 0.5, 0.25).
# Example content of run_internvl.sh
export LMUData="your/data/path"
export REDUCTION_LAYER_IDX=2
export VISION_EXPERT_REDUCE_FACTOR=0.5
export TOKEN_MERGE_STRATEGY='hybrid'
export BASE_ALPHA=0.5
export MERGE_RATIO=0.025
export TOKEN_MERGE_METHOD='mlerp'
export MERGE_LAYER_LOCS="5,8,12"
export KEEP_TOKEN_RATIO="0.75" 

# Script logic automatically converts 0.75 -> "0.91,0.91,0.91" (per stage)

GPU=8
torchrun --nproc-per-node=$GPU run.py --model InternVL3_5-30B-A3B --data MMMU_DEV_VAL --verbose --work-dir './outputs/InternVL3_5-30B-A3B'

2. DeepSeek-VL2 Performance Evaluation

Run the script located at DeepSeek-VL2/VLMEvalKit/scripts/run_deepseek.sh.

# Example content of run_deepseek.sh
export LMUData="your/data/path"
export REDUCTION_LAYER_IDX=2
export VISION_EXPERT_REDUCE_FACTOR=0.5
export MERGE_LAYER_LOCS="2,5,8"
export TOKEN_MERGE_STRATEGY="hybrid"
export TOKEN_MERGE_METHOD='mlerp'
export BASE_ALPHA=0.3
export MERGE_RATIO=0.05
export KEEP_TOKEN_RATIO="0.91,0.91,0.91"
export ROUTING_SIMILARITY_WINDOW_SIZE=3

GPU=8
torchrun --nproc-per-node=$GPU run.py --model deepseek_vl2 --data OCRBench ScienceQA_TEST --verbose --work-dir ./outputs/deepseek_vl2 --reuse --judge exact_matching

3. Latency Benchmarking (lmms-eval)

We use the mmmu_val task for speed benchmarking to support batch inference.

Data Configuration: Before running, you need to configure the dataset path in lmms-eval/lmms_eval/tasks/mmmu/mmmu_val.yaml:

# In lmms-eval/lmms_eval/tasks/mmmu/mmmu_val.yaml
dataset_path: /path/to/your/data/lmms_data/llms_MMMU  # <--- Modify this to your local path

Running the Benchmark: Execute the following command to start the speed test:

bash lmms-eval/examples/models/internvl3.sh mmmu_val

Note: The accuracy metrics output by this script are not representative. Use it strictly for prefill latency measurement.

🧩 Core Algorithm Implementation

The core logic for Dynamic Expert Activation and Routing-Aware Token Pruning is implemented in the following directories:

Model ArchitectureImplementation Path
InternVL3.5Internvl3_5/VLMEvalKit/vlmeval/vlm/internvl
DeepSeek-VL2DeepSeek-VL2/deepseek_vl2/models

πŸ™ Acknowledgments

We sincerely appreciate the open-source contributions from the following projects, which provided valuable codebases and baselines for our research:


πŸ“œ Citation

If you find FastMMoE useful for your research, please cite our paper:

@article{xia2025fastmmoe,
  title={FastMMoE: Accelerating Multimodal Large Language Models through Dynamic Expert Activation and Routing-Aware Token Pruning},
  author={Xia, Guoyang and Ding, Yifeng and Li, Fengfa and Ren, Lei and Chen, Wei and Feng, Fangxiang and Wang, Xiaojie},
  journal={arXiv preprint arXiv:2511.17885},
  year={2025}
}