[CVPR 2026 Findings] FastMMoE: Accelerating Multimodal Large Language Models through Dynamic Expert Activation and Routing-Aware Token Pruning
March 20, 2026 Β· View on GitHub
FastMMoE is a training-free acceleration framework tailored for Mixture-of-Experts (MoE) based Multimodal Large Language Models (MLLMs). By leveraging Dynamic Expert Activation and Routing-Aware Token Pruning, FastMMoE significantly reduces inference latency and computational costs (FLOPs) while maintaining model performance.
This repository contains the official implementation of FastMMoE, supporting InternVL3.5 and DeepSeek-VL2.
π Repository Structure
The code is organized into three main directories, each serving a specific purpose:
βββ DeepSeek-VL2 # Implementation for DeepSeek-VL2 (Performance Benchmark)
βΒ Β βββ DeepSeek_VL2_paper.pdf
βΒ Β βββ VLMEvalKit # Modified VLMEvalKit for DeepSeek-VL2 evaluation
βΒ Β βββ deepseek_vl2 # Core model code
βΒ Β βββ ...
βββ Internvl3_5 # Implementation for InternVL3.5 (Performance Benchmark)
βΒ Β βββ VLMEvalKit # Modified VLMEvalKit for InternVL3.5 evaluation
βββ lmms-eval # Latency/Speed Benchmark (Supports Batch Size > 1)
βββ README.md
- Internvl3_5 & DeepSeek-VL2: These folders are used for Performance/Accuracy Benchmarking (e.g., MMMU, OCRBench) using
VLMEvalKit. - lmms-eval: This folder is dedicated to Latency/Speed Testing (specifically for prefill latency), as the standard
VLMEvalKitdoes not supportbatch_size > 1. - Note:
lmms-evalshould only be used for speed measurement; the accuracy metrics generated here may not be representative.
π οΈ Installation & Environment
Important: This repository requires three separate virtual environments corresponding to the three main folders to avoid dependency conflicts.
1. Environment for InternVL3.5
Used for performance evaluation of InternVL3.5.
conda create -n internvl3 python=3.10 -y
cd Internvl3_5/VLMEvalKit
pip install -e .
pip install flash-attn --no-build-isolation
2. Environment for DeepSeek-VL2
Used for performance evaluation of DeepSeek-VL2. Note the two-step installation.
conda create -n deepseek_vl python=3.10 -y
cd DeepSeek-VL2
pip install -e .
cd VLMEvalKit
pip install -e .
pip install flash-attn --no-build-isolation
3. Environment for Latency Testing (lmms-eval)
Used for latency measurement.
conda create -n lmms_test python=3.10 -y
cd lmms-eval
pip install -e .
pip install flash-attn --no-build-isolation
βοΈ Configuration
Before running the evaluation scripts, you must configure the local paths to your model checkpoints.
For InternVL3.5:
Modify Internvl3_5/VLMEvalKit/vlmeval/config.py:
"InternVL3_5-30B-A3B": partial(
InternVLChat,
model_path="/path/to/your/models/opengvlab__internvl3_5-30b-a3b", # <--- Modify this
version="V2.0"
),
For DeepSeek-VL2:
Modify DeepSeek-VL2/VLMEvalKit/vlmeval/config.py:
deepseekvl2_series = {
"deepseek_vl2_tiny": partial(
DeepSeekVL2, model_path="/path/to/your/models/deepseek-vl2-tiny" # <--- Modify this
),
# ... modify other sizes as needed
"deepseek_vl2": partial(DeepSeekVL2, model_path="/path/to/your/models/deepseek-vl2"),
}
π Usage
1. InternVL3.5 Performance Evaluation
Run the script located at Internvl3_5/VLMEvalKit/scripts/run_internvl.sh.
Key Hyperparameters:
REDUCTION_LAYER_IDX: The starting layer index for expert activation reduction.VISION_EXPERT_REDUCE_FACTOR: The ratio of experts to keep (e.g., 0.5 means halving the active experts).MERGE_RATIO: The merge ratio .KEEP_TOKEN_RATIO: Target retention ratio (e.g., 0.75, 0.5, 0.25).
# Example content of run_internvl.sh
export LMUData="your/data/path"
export REDUCTION_LAYER_IDX=2
export VISION_EXPERT_REDUCE_FACTOR=0.5
export TOKEN_MERGE_STRATEGY='hybrid'
export BASE_ALPHA=0.5
export MERGE_RATIO=0.025
export TOKEN_MERGE_METHOD='mlerp'
export MERGE_LAYER_LOCS="5,8,12"
export KEEP_TOKEN_RATIO="0.75"
# Script logic automatically converts 0.75 -> "0.91,0.91,0.91" (per stage)
GPU=8
torchrun --nproc-per-node=$GPU run.py --model InternVL3_5-30B-A3B --data MMMU_DEV_VAL --verbose --work-dir './outputs/InternVL3_5-30B-A3B'
2. DeepSeek-VL2 Performance Evaluation
Run the script located at DeepSeek-VL2/VLMEvalKit/scripts/run_deepseek.sh.
# Example content of run_deepseek.sh
export LMUData="your/data/path"
export REDUCTION_LAYER_IDX=2
export VISION_EXPERT_REDUCE_FACTOR=0.5
export MERGE_LAYER_LOCS="2,5,8"
export TOKEN_MERGE_STRATEGY="hybrid"
export TOKEN_MERGE_METHOD='mlerp'
export BASE_ALPHA=0.3
export MERGE_RATIO=0.05
export KEEP_TOKEN_RATIO="0.91,0.91,0.91"
export ROUTING_SIMILARITY_WINDOW_SIZE=3
GPU=8
torchrun --nproc-per-node=$GPU run.py --model deepseek_vl2 --data OCRBench ScienceQA_TEST --verbose --work-dir ./outputs/deepseek_vl2 --reuse --judge exact_matching
3. Latency Benchmarking (lmms-eval)
We use the mmmu_val task for speed benchmarking to support batch inference.
Data Configuration:
Before running, you need to configure the dataset path in lmms-eval/lmms_eval/tasks/mmmu/mmmu_val.yaml:
# In lmms-eval/lmms_eval/tasks/mmmu/mmmu_val.yaml
dataset_path: /path/to/your/data/lmms_data/llms_MMMU # <--- Modify this to your local path
Running the Benchmark: Execute the following command to start the speed test:
bash lmms-eval/examples/models/internvl3.sh mmmu_val
Note: The accuracy metrics output by this script are not representative. Use it strictly for prefill latency measurement.
π§© Core Algorithm Implementation
The core logic for Dynamic Expert Activation and Routing-Aware Token Pruning is implemented in the following directories:
| Model Architecture | Implementation Path |
|---|---|
| InternVL3.5 | Internvl3_5/VLMEvalKit/vlmeval/vlm/internvl |
| DeepSeek-VL2 | DeepSeek-VL2/deepseek_vl2/models |
π Acknowledgments
We sincerely appreciate the open-source contributions from the following projects, which provided valuable codebases and baselines for our research:
π Citation
If you find FastMMoE useful for your research, please cite our paper:
@article{xia2025fastmmoe,
title={FastMMoE: Accelerating Multimodal Large Language Models through Dynamic Expert Activation and Routing-Aware Token Pruning},
author={Xia, Guoyang and Ding, Yifeng and Li, Fengfa and Ren, Lei and Chen, Wei and Feng, Fangxiang and Wang, Xiaojie},
journal={arXiv preprint arXiv:2511.17885},
year={2025}
}