MxMoE: Mixed-precision Quantization for MoE with Accuracy and Performance Co-Design

May 25, 2025 · View on GitHub

¹SJTU, ²Shanghai AI Laboratory, ³PKU, ⁴Bytedance Seed, ⁵CUHK.

🙈 TL;DR

Can we design a quantization scheme specifically tailored to MoE models that effectively balancing model accuracy and computational efficiency?

Insight:

Linear-block level: Linear blocks exhibit varying quantization sensitivity
Expert level: Imbalanced activation frequencies $\Rightarrow$ heterogeneous computational characteristics (e.g. some experts are compute-bound while some are memory-bound).

Approach:

We explore the automated design of mixed-precision quantization scheme for MoE models:

Assign bitwidth at linear-block level.
Optimize bitwidth allocation (which is formulated as an ILP) by taking both model accuracy (quantization loss estimation) and computational efficiency (performance model) into consideration.
Generate mixed-precision GroupGEMM operator through template-based kernel generation.

MxMoE

🚀 Results

MxMoE

🛠️ Usage

Pre-requirement

# 0. virtual env
conda create -n mxmoe python=3.12 -y

# 1. source code dependencies
git submodule update --init --recursive

# 2. python package dependencies
pip install -r requirements.txt
cd mxmoe/3rdparty/fast-hadamard-transform && pip install . && cd -

View the activation statistic of MoE models:

# e.g. sample data from humaneval-x to observe qwen2_moe (in fact qwen1.5moe)
CUDA_VISIBLE_DEVICES=0 python -m mxmoe.quant.moe_tracer --model qwen2_moe --trace_gate --dataset humaneval-x

Calibration

Get the quant loss of each linear-blocks of <MOE_MODEL> (e.g. qwen2_moe) under certain <QUANTIZATION_CONFIG> (e.g. w4a4_g-1_sym):

CUDA_VISIBLE_DEVICES=0 python -m mxmoe.quant.quant calib --model qwen2_moe --method rtn --metric layer_out_norm --qcfg w4a4_g-1_sym

Solve the ILP based on the quant loss and kernel profile. Quantization scheme will be saved in qconfigs

# e.g. re-produce mxmoe w5a5
python -m mxmoe.quant.bits_solver --model qwen2_moe --qtype gptq-had --wbits 5.0 --solve_mode layer --batch 8192 --filter_list w4a4_g-1_sym w8a8_g-1_sym

Accuracy Eval. You can re-produce the exp in paper by setting corresponding tasks and quantization configs.

# e.g. evaluate the performance of qwen1.5_moe under RTN w4a4_g-1_sym quantization config
CUDA_VISIBLE_DEVICES=2 python -m mxmoe.quant.quant eval --model qwen2_moe --method rtn-had --qstr w4a4_g-1_sym --tasks ppl

Performance (Computational Efficiency) Eval. After we get the mixed-precision scheme (Step 3), we can automatially generate corresponding GroupGEMM kernel.

please refer to run_mxmoe_gg.py.

# e.g. test groupgemm in the layer-11 of qwen2_moe model (FP16):
python run_mxmoe_gg.py --model qwen2_moe --bs 8192 --layer 11

# e.g. test groupgemm in the layer-11 of qwen2_moe model under mixed-precision
python run_mxmoe_gg.py --model qwen2_moe --bs 8192 --layer 11 --qconfig <QCONFIG> --tile_config <TCONFIG>

👀 Limitations

⚠️ MxMoE is mainly test on RTX-4090. The current implementation cannot fully utilize the performance of hopper and later GPU architectures.

⚠️ The profile step is a little bit time consuming. We expect to upload our previously calculated quant loss statistics and kernel profile data soon, allowing you to skip this step.

😺 Citation

If you find MxMoE useful or relevant to your project and research, please kindly cite our paper:

@article{duanmu2025mxmoe,
  title={MxMoE: Mixed-precision Quantization for MoE with Accuracy and Performance Co-Design},
  author={Duanmu, Haojie and Li, Xiuhong and Yuan, Zhihang and Zheng, Size and Duan, Jiangfei and Zhang, Xingcheng and Lin, Dahua},
  journal={arXiv preprint arXiv:2505.05799},
  year={2025}
}