MoEQuant: Enhancing Quantization for Mixture-of-Experts Large Language Models via Expert-Balanced Sampling and Affinity Guidance

January 24, 2025 ยท View on GitHub

MoEQUANT Logo

Overview

MoEQUANT, a novel post-training quantization framework tailored for Mixture-of-Experts (MoE) large language models, integrates Expert-Balanced Self-Sampling (EBSS) and Affinity-Guided Quantization (AGQ) to optimize both calibration and quantization processes. MoEQuant successfully quantizes MoE-based LLMs to low-bit precision with minimal accuracy loss, achieving near-floating-point performance and enhanced generalization across various models. This marks the first comprehensive PTQ solution specifically designed for MoE architectures.

This repository accompanies our ICML 2025 manuscript titled "Enhancing Quantization for Mixture-of-Experts Large Language Models via Expert-Balanced Sampling and Affinity Guidance".

Table of Contents

Features

  • Expert-Balanced Self-Sampling:
  • Affinity-Guided Quantization:
  • Performance Optimization:

Installation

Coming Soon.

Detailed instructions for setting up the MoEQuant framework will be provided once the code is released. Stay tuned for updates!

Usage

Coming Soon.

Comprehensive usage examples and tutorials will be available with the code release to help you get started with MoEQuant effortlessly.

Contributing

We welcome contributions from the research and development community! Whether you're interested in improving the existing features, adding new functionalities, or reporting issues, your input is invaluable.

Citation

If you find MoEQuant useful in your research, please consider citing our paper:

@inproceedings{moequant2025,
  title={MoEQuant: Enhancing Quantization for Mixture-of-Experts Large Language Models via Expert-Balanced Sampling and Affinity Guidance}
  ...
}