AdaMuon

August 27, 2025 · View on GitHub

This is the official repository for the paper AdaMuon: Adaptive Muon Optimizer.

Introduction

AdaMuon is an effective optimizer based on Muon. It can achieve more than 40% training efficiency compared to AdamW.

Quick Start

This repository contains two projects: one is the GPT-2 experiments, and the other is the open-sourced Megatron-LM code, which we included to facilitate large-scale experiments.

To use AdaMuon in your own training pipeline on other architectures and datasets, use the following pseudo code as an example:

from opt_config import configure_optimizers

# Model
model = Model()

# Optimizer
optimizer = configure_optimizers(model.parameters(), weight_decay=0.1, learning_rate=6e-4)

# Training
for epoch in range(epochs):
    for X, Y in data_loader:
        # standard training code
        logits, loss = model(X, Y)
        loss.backward()
        # ...

Performance

License

This repository is licensed under the Apache 2.0 license. See the LICENSE file for more details.

Citation

@article{si2025adamuon,
  title={AdaMuon: Adaptive Muon Optimizer},
  author={Si, Chongjie and Zhang, Debing and Shen, Wei},
  journal={arXiv preprint arXiv:2507.11005},
  year={2025}
}

Contact

If you have any questions, please raise an issue or contact us at chongjiesi@sjtu.edu.cn.