[ICML 2026] OLion

June 2, 2026 · View on GitHub

[ICML 2026] OLion

[ICML 2026] OLion: Approaching the Hadamard Ideal by Intersecting Spectral and ℓ∞ Implicit Biases

Conference Arxiv Code

News


Paper Overview

Many optimizers can be interpreted as steepest-descent methods under norm-induced geometries, and thus inherit corresponding implicit biases. We introduce OLion (Orthogonal Lion), which combines spectral control from orthogonalized update directions with ℓ∞-style coordinate control from sign updates.

At each step, we form a Lion-style momentum direction, approximately orthogonalize it via a few Newton–Schulz iterations, and then apply an entrywise sign. This provides an efficient approximation to taking a maximal step over the intersection of the spectral and ℓ∞ constraint sets—a scaled Hadamard-like set for matrix parameters. Concretely:

  1. Spectral control (from Muon): we orthogonalize the update direction via Newton–Schulz iterations, yielding a flattened singular-value profile and bounded spectral norm.
  2. ℓ∞-style coordinate control (from Lion): we apply an element-wise sign to the direction, capping each coordinate’s contribution and promoting uniform entrywise magnitudes.

In the implementation, we optionally apply a lightweight magnitude alignment (e.g., RMS scaling) to stabilize effective step sizes across layers and tensor shapes. As a result, OLion preserves Muon’s memory efficiency (momentum-level state only) while incorporating the practical benefits of sign-based updates.

Our contributions:

  • Theory: Despite the strong nonlinearity of orthogonalization and sign, we prove convergence under a mild, empirically verified diagonal-isotropy assumption.
  • Practice: Across large-scale language and vision training—GPT-2 and Llama pretraining, SiT image pretraining, and supervised fine-tuning—we show that OLion matches or outperforms AdamW and Muon under comparable tuning, and it mitigates optimizer mismatch when fine-tuning AdamW-pretrained checkpoints (e.g., Llama-3.1-8B).
  • Systems: We note that the sign operation in OLion naturally supports communication-efficient (e.g., 1-bit) distributed training and is friendly to low-precision quantization.

Geometry Motivation

We approach the design through the geometry of intersecting constraints. The figure below illustrates how we view Muon and Lion as maximal-update methods under two norm-induced geometries; their intersection suggests a scaled Hadamard set as an idealized target for matrix-shaped updates, motivating our intersection-seeking design.

Geometry motivation: spectral vs ℓ∞ and Hadamard ideal

Implicit Bias: Spectral and ℓ∞ Norms

A simple experiment confirms the intended bias intersection: we find that OLion maintains both a small spectral norm and a small ℓ∞ norm during training, whereas other optimizers favor only one of the two. Below we show the evolution of spectral norm and ℓ∞ norm for representative weight matrices in GPT-2 small pretraining.

Spectral norm (768×768)Spectral norm (3072×768)
Spectral norm 768Spectral norm 3072
ℓ∞ norm (768×768)ℓ∞ norm (3072×768)
Linf norm 768Linf norm 3072

Overview

We propose OLion (Orthogonal Lion) as an efficient and effective optimizer that:

  • Combines spectral control from orthogonalized update directions (Muon-style) with ℓ∞-style coordinate control from sign updates (Lion-style).
  • Uses only momentum-level optimizer state, matching the memory footprint of Lion/Muon.
  • Improves pretraining (GPT-2, Llama-2-7B, SiT) and supervised fine-tuning (e.g., Llama-3.1-8B on math/reasoning benchmarks), and mitigates optimizer mismatch when fine-tuning AdamW-pretrained models.

Getting Started

Installation & Training Scripts

nanoGPT Setup

To run nanoGPT experiments:

cd nanoGPT
conda create -n nanogpt python=3.10
pip install torch numpy transformers datasets tiktoken wandb tqdm

Llama Setup

To run Llama-2-7B pretraining:

cd Llama
conda env create -f environment.yml
pip install -r requirements.txt
python -m pip install --pre torch torchvision torchaudio --index-url https://download.pytorch.org/whl/nightly/cu121

SiT Pretraining Setup

Please refer to the REPA repository.


Running Experiments

nanoGPT (GPT-2) Training

We use the OpenWebText dataset. Train GPT-2 with OLion:

cd nanoGPT
bash run.sh

You can change the optimizer, batch size, learning rate, and model scale in run.sh. Example validation loss curves (OLion vs baselines):

GPT-2 Small (124M)GPT-2 Medium (355M)GPT-2 Large (770M)
nanoGPT smallnanoGPT mediumnanoGPT large

Llama-2-7B Pretraining

To run Llama-2-7B pretraining with OLion:

cd Llama
bash run_llama_2_7b.sh

Training configurations (optimizer, learning rate, batch size, dataset, etc.) can be edited in Llama/train_configs/llama2_7b.toml.

Training lossValidation loss
Llama trainLlama valid

SiT-B/2 Image Pretraining

To run SiT-B/2 pretraining with OLion:

cd SIT
bash run.sh

Modify settings in SIT/run.sh as needed.

Projection lossDenoising loss
REPA projectionREPA loss

Learning-Rate Robustness

OLion retains an advantage over a wide range of learning rates (e.g., 3e-4 to 5e-3 on GPT-2 small):

Validation loss vs learning rate for OLion, Muon, AdaMuon

Reproducibility

  • Paper: arXiv:2602.01105
  • Code: This repository (nanoGPT, Llama, SIT) with configs under each subdirectory.
  • Figures in this README use paths under the images/ directory (e.g., images/geometry.png, images/spectral_1.png). Place the figure files in an images/ folder at the repo root (if they are currently in the parent directory, copy them into images/ so the links work).

Acknowledgements

Our training framework is built on nanoGPT, torchtitan, and REPA.


Citation

@misc{wang2026olionapproachinghadamardideal,
      title={OLion: Approaching the Hadamard Ideal by Intersecting Spectral and $\ell_{\infty}$ Implicit Biases},
      author={Zixiao Wang and Yifei Shen and Huishuai Zhang},
      year={2026},
      eprint={2602.01105},
      archivePrefix={arXiv},
      primaryClass={cs.LG},
      url={https://arxiv.org/abs/2602.01105},
}