RMNP: Row-Momentum Normalized Preconditioning for Scalable Matrix-Based Optimization
June 13, 2026 · View on GitHub
This repository contains the official implementation of RMNP (Row-Momentum Normalized Preconditioning), a scalable matrix-based optimizer for large model pre-training. RMNP replaces the Newton–Schulz (NS) iteration used by Muon with a simple per-row normalization of the momentum buffer, which is provably equivalent to Muon's orthogonalization step under the row-wise block-diagonal dominance regime that we observe to hold (and grow stronger) for transformer gradient momentum matrices in practice.
Algorithms
Muon
RMNP (Ours)
with
Diagonal-Dominance Monitoring
To verify the condition under which holds, we monitor the row-wise diagonal-dominance ratio of the Gram matrix throughout training. For row we define
and aggregate across rows to obtain , , . Averaging these three statistics over all matrix parameters in the network gives the global metrics , , . A value means the diagonal entry exceeds the mean off-diagonal magnitude in row — the larger, the closer is to a diagonal matrix.

Observations. Across all six configurations and the full training trajectory: stays comfortably above the threshold, consistently exceeds $5\overline{r_{\max}}\overline{r}$ across all three statistics than their smaller counterparts. This indicates that the row-wise block-diagonal dominance underlying RMNP is not an artefact of small scale; it becomes more pronounced as models scale, making RMNP an increasingly favorable replacement for Muon's NS iteration at scale.
Key idea. When is row-diagonally dominant (empirically observed and strengthening with scale), the leading singular directions of align with its rows, and the orthogonal factor from satisfies . RMNP therefore matches Muon's update direction while replacing the iterative NS polynomial (multiple matmuls per step) with a single elementwise normalization — yielding lower wall-clock cost and friendlier scaling to large hidden dimensions.
Main Results
Perplexity

RMNP matches or exceeds Muon's perplexity across every model scale and dataset, consistent with the diagonal-dominance trend reported above.
Preconditioner Wall-Clock Time
RMNP's row normalization is 13×–44× faster than Muon's Newton–Schulz orthogonalization on GPT-2 models from 60M to 1.5B (measured over 100 steps with batch size 16 on a single RTX Pro 6000 GPU), and the gap widens with model size: as Newton–Schulz becomes the dominant bottleneck at scale, RMNP's lightweight preconditioner becomes increasingly attractive for very large models.
Repository Layout
RMNP/
├── GPT-2/ # GPT-2 (125M / 355M / 770M / XL) pre-training pipeline
│ ├── RMNP/ # model & training entrypoints (train_{adamw,muon,rmnp}.py)
│ ├── config/ # per-(size, optimizer) training configs
│ ├── scripts/ # ready-to-run shell launchers
│ └── data/ # OpenWebText preparation (nanoGPT-style)
└── LLaMA/ # LLaMA (60M / 135M / 350M / 1B) pre-training pipeline
├── optimizers/ # RMNP_optimizer.py, muon_optimizer.py
├── configs/ # llama_{60m,135m,350m,1b}.json model configs
├── scripts/ # per-(size, optimizer) launchers
└── torchrun_main.py # distributed training entrypoint
Both sub-projects ship three optimizer baselines — AdamW, Muon, RMNP — so that results can be reproduced under matched data, schedule, and hyperparameters.
Installation
We recommend Python 3.12 with CUDA-capable GPUs. Create a fresh environment and install the pinned dependencies:
git clone https://github.com/Dominator-Index/RMNP.git
cd RMNP
conda create -n rmnp python=3.12 -y
conda activate rmnp
pip install -r requirements.txt
flash-attnrequires a working CUDA toolchain and may take several minutes to build; if it fails, install it separately withpip install flash-attn --no-build-isolationaftertorchis in place. The setup is fully compatible with the upstream MARS repository, so its install instructions also work.
Quick Start
Each sub-project is self-contained; see its local README for dataset preparation and per-script hyperparameters:
GPT-2/README.md— GPT-2 pre-training on OpenWebText (Small / Medium / Large) and FineWeb-Edu (Small / Medium / Large / XL).LLaMA/README.md— LLaMA pre-training (60M – 1B) withtorchrun.
Once the environment is ready, launch a run with:
# GPT-2 Small with RMNP
cd GPT-2
export HF_TOKEN=... # for streaming datasets
export WANDB_API_KEY=...
bash scripts/run_rmnp_small.sh
# LLaMA 60M with RMNP
cd LLaMA
bash scripts/train_RMNP_60m.sh
Using the RMNP Optimizer in Your Own Code
A standalone optimizer package lives at rmnp/. Install via PyPI:
pip install rmnp
Use it like any torch.optim.Optimizer. Following Muon's convention, route all 2D weight matrices through RMNP and 1D/0D parameters (biases, LayerNorm scales) through AdamW:
import torch
from rmnp import RMNP
matrix_params = [p for p in model.parameters() if p.ndim >= 2]
other_params = [p for p in model.parameters() if p.ndim < 2]
rmnp_opt = RMNP(matrix_params, lr=2e-2, momentum=0.95, weight_decay=0.0)
adamw_opt = torch.optim.AdamW(other_params, lr=3e-4, weight_decay=0.1)
# In the training loop: call .step() on both, .zero_grad() on both.
Distributed training works out of the box: when WORLD_SIZE > 1, updates are sharded across ranks and synchronized via all_reduce.
Citation
If you find this work useful, please cite:
@article{deng2026rmnp,
title = {RMNP: Row-Momentum Normalized Preconditioning for Scalable Matrix-Based Optimization},
author = {Deng, Shenyang and Ouyang, Zhuoli and Pang, Tianyu and Liu, Zihang and Jin, Ruochen and Yu, Shuhua and Yang, Yaoqing},
journal = {arXiv preprint arXiv:2603.20527},
year = {2026}
}
Acknowledgements
This repository is built upon MARS and GaLore. We thank the authors for open-sourcing their codebases.