MxMoE: Mixed-precision Quantization for MoE with Accuracy and Performance Co-Design
May 25, 2025 ยท View on GitHub
Haojie Duanmu1,2 | Xiuhong Li3 | Zhihang Yuan3 | Size Zheng4 | Jiangfei Duan5
Xingcheng Zhang2 | Dahua Lin2,5
1SJTU, 2Shanghai AI Laboratory, 3PKU, 4Bytedance Seed, 5CUHK.
๐ TL;DR
Can we design a quantization scheme specifically tailored to MoE models that effectively balancing model accuracy and computational efficiency?
Insight:
- Linear-block level: Linear blocks exhibit varying quantization sensitivity
- Expert level: Imbalanced activation frequencies heterogeneous computational characteristics (e.g. some experts are compute-bound while some are memory-bound).
Approach:
We explore the automated design of mixed-precision quantization scheme for MoE models:
- Assign bitwidth at linear-block level.
- Optimize bitwidth allocation (which is formulated as an ILP) by taking both model accuracy (quantization loss estimation) and computational efficiency (performance model) into consideration.
- Generate mixed-precision GroupGEMM operator through template-based kernel generation.

๐ Results

๐ ๏ธ Usage
-
Pre-requirement
# 0. virtual env conda create -n mxmoe python=3.12 -y # 1. source code dependencies git submodule update --init --recursive # 2. python package dependencies pip install -r requirements.txt cd mxmoe/3rdparty/fast-hadamard-transform && pip install . && cd - -
View the activation statistic of MoE models:
# e.g. sample data from humaneval-x to observe qwen2_moe (in fact qwen1.5moe) CUDA_VISIBLE_DEVICES=0 python -m mxmoe.quant.moe_tracer --model qwen2_moe --trace_gate --dataset humaneval-x -
Calibration
-
Get the quant loss of each linear-blocks of
<MOE_MODEL>(e.g.qwen2_moe) under certain<QUANTIZATION_CONFIG>(e.g.w4a4_g-1_sym):CUDA_VISIBLE_DEVICES=0 python -m mxmoe.quant.quant calib --model qwen2_moe --method rtn --metric layer_out_norm --qcfg w4a4_g-1_sym -
Solve the ILP based on the quant loss and kernel profile. Quantization scheme will be saved in
qconfigs# e.g. re-produce mxmoe w5a5 python -m mxmoe.quant.bits_solver --model qwen2_moe --qtype gptq-had --wbits 5.0 --solve_mode layer --batch 8192 --filter_list w4a4_g-1_sym w8a8_g-1_sym
-
-
Accuracy Eval. You can re-produce the exp in paper by setting corresponding tasks and quantization configs.
# e.g. evaluate the performance of qwen1.5_moe under RTN w4a4_g-1_sym quantization config CUDA_VISIBLE_DEVICES=2 python -m mxmoe.quant.quant eval --model qwen2_moe --method rtn-had --qstr w4a4_g-1_sym --tasks ppl -
Performance (Computational Efficiency) Eval. After we get the mixed-precision scheme (Step 3), we can automatially generate corresponding GroupGEMM kernel.
please refer to run_mxmoe_gg.py.
# e.g. test groupgemm in the layer-11 of qwen2_moe model (FP16): python run_mxmoe_gg.py --model qwen2_moe --bs 8192 --layer 11 # e.g. test groupgemm in the layer-11 of qwen2_moe model under mixed-precision python run_mxmoe_gg.py --model qwen2_moe --bs 8192 --layer 11 --qconfig <QCONFIG> --tile_config <TCONFIG>
๐ Limitations
โ ๏ธ MxMoE is mainly test on RTX-4090. The current implementation cannot fully utilize the performance of hopper and later GPU architectures.
โ ๏ธ The profile step is a little bit time consuming. We expect to upload our previously calculated quant loss statistics and kernel profile data soon, allowing you to skip this step.
๐บ Citation
If you find MxMoE useful or relevant to your project and research, please kindly cite our paper:
@article{duanmu2025mxmoe,
title={MxMoE: Mixed-precision Quantization for MoE with Accuracy and Performance Co-Design},
author={Duanmu, Haojie and Li, Xiuhong and Yuan, Zhihang and Zheng, Size and Duan, Jiangfei and Zhang, Xingcheng and Lin, Dahua},
journal={arXiv preprint arXiv:2505.05799},
year={2025}
}