MuonQ: Enhancing Low-Bit Muon Quantization via Directional Fidelity Optimization

May 12, 2026 ยท View on GitHub

MuonQ is a low-bit training framework for the Muon optimizer. This repository contains the training code, Hydra configs, and data preprocessing pipeline used for MuonQ experiments.

Highlights

  • Pure 4-bit Muon state quantization for matrix-shaped hidden-layer parameters.
  • Directional fidelity optimization through three components:
    • pre-quantization normalization,
    • structural decomposition via power iteration,
    • mu-law companding quantization.
  • Memory efficient training: MuonQ reduces optimizer-state memory by up to 7.3x while closely matching full-precision Muon in training loss and downstream zero-shot accuracy in the paper experiments.
  • Hydra-based experiments for GPT-style and LLaMA-style language models.

Environment Setup

Create and activate a Python environment. Python 3.10 or newer is recommended.

conda create -n muonq python=3.12 -y
conda activate muonq

Install PyTorch separately so the CUDA build matches your machine. For example, for CUDA 12.8:

pip install torch==2.9.1 torchvision==0.24.1 torchaudio==2.9.1 --index-url https://download.pytorch.org/whl/cu128

Install the remaining dependencies:

pip install -r requirements.txt

Create the required local environment file:

cp .env.example .env

Edit .env for your machine:

DATA_DIR=/path/to/preprocessed/data
HF_HOME=/path/to/huggingface/cache

DATA_DIR is where preprocessed binary dataset shards are written and read. HF_HOME is the Hugging Face cache directory.

Data Preparation

After setting up .env, preprocess the default FineWeb-100B dataset with the LLaMA tokenizer:

python process_data.py --name fineweb100B --tokenizer llama2

This writes binary shards under:

$DATA_DIR/fineweb100B-Llama-2-7b-hf/

That path is used by the LLaMA recipes in hydra_conf/recipe/.

Training

Training is configured through Hydra. A typical launch command is:

GPUS=0,1,2,3
NGPUS=4
RECIPE=llama-60m
OPT=muonq
RUN_NAME=${RECIPE}_${OPT}

CUDA_VISIBLE_DEVICES=$GPUS \
torchrun \
  --standalone \
  --nproc-per-node=$NGPUS \
  run_hydra.py -cn test_hydra \
    recipe=${RECIPE} \
    optimizer_params=${OPT} \
    +logging_params.wandb.project=MuonQ \
    +logging_params.wandb.name=${RUN_NAME} \
  |& tee logs/${RUN_NAME}.log

recipe=${RECIPE} selects a config from hydra_conf/recipe/.

optimizer_params=${OPT} selects a config from hydra_conf/optimizer_params/.

If you use Weights & Biases logging, log in before training:

wandb login

run.sh is a minimal wrapper around the same command:

bash run.sh 0,1,2,3 muonq llama-60m

Citation

The paper is currently under double-blind review. Citation information will be added after release.