Controlled LLM Training on Spectral Sphere

January 14, 2026 · View on GitHub

arXiv |

Megatron-LM |

WandB

1. Introduction

This repository contains the official implementation for the paper: Controlled LLM Training on Spectral Sphere.

Abstract: Scaling large models requires optimization strategies that ensure rapid convergence grounded in stability. Maximal Update Parametrization (μP) provides a theoretical safeguard for width-invariant Θ(1) activation control, whereas emerging optimizers like Muon are only "half-aligned" with these constraints: they control updates but allow weights to drift. To address this limitation, we introduce the Spectral Sphere Optimizer (SSO), which enforces strict module-wise spectral constraints on both weights and their updates. By deriving the steepest descent direction on the spectral sphere, SSO realizes a fully μP-aligned optimization process. To enable large-scale training, we implement SSO as an efficient parallel algorithm within Megatron. Through extensive pretraining on diverse architectures, including Dense 1.7B, MoE 8B-A1B, and 200-layer DeepNet models, SSO consistently outperforms AdamW and Muon. Furthermore, we observe significant practical stability benefits, including improved MoE router load balancing, suppressed outliers, and strictly bounded activations. Megatron Code is available at SSO Pretrain.

Key Contributions:

Better Convergence: Outperforms AdamW and Muon with a substantial margin in Dense 1.7B, MoE 8B and 200-layer DeepNet, while keeping the "healthiest" model intrinsic metric
Controlled Stability: Both weights and updates satisfy μP constraints, offer tunable sphere radius, suppressed outliers and controlled activation scales in favour of low-precision training
System Efficiency: Atomic Module Sharding, Adaptive Kernel Dispatcher, Cached Singular Vectors, etc. MuonSphere variant retains equivalent activation control with minimal overhead

2. Algorithm

SSO performs steepest descent under the spectral norm, constraining both the weights and the updates to a spectral sphere of radius R = Θ(√(d_out/d_in)).

3. WandB Runs

Description	Link
Main Experiments on Dense, MoE, DeepNet	Baselines
μP Learning Rate Transfer Grid Search	MuP Search
Spectral Radius Search for Tunable Activation Scale	Radius Search

4. Evaluation

Learning Rate Transfer

Controllable Activation Scale

Dense 1.7B Eval

MoE 8B-a1B Eval

5. Usage

5.1 Megatron-LM Integration

SSO is implemented in our fork of Megatron-LM. Use --optimizer spectral_ball_dist for distributed training.

5.2 Hyperparameters

Argument	Default	Description
`--spectral-ball-momentum`	0.9	Momentum coefficient
`--spectral-ball-use-nesterov`	True	Use Nesterov-style momentum
`--spectral-ball-msign-steps`	8	Newton-Schulz iterations for matrix sign
`--spectral-ball-solver`	bisection	Lagrange multiplier solver method
`--spectral-ball-solver-tolerance-f`	1e-8	Solver tolerance
`--spectral-ball-solver-max-iterations`	20	Maximum solver iterations
`--spectral-ball-power-iteration-steps`	20	Power iteration steps for top singular vectors
`--spectral-ball-radius-mode`	spectral_mup	Mode for computing target radius R
`--spectral-ball-radius-scaler`	1.0	Scale factor for target radius
`--spectral-ball-scale-mode`	spectral_mup	LR scale mode (spectral_mup, align_adamw_rms, shape_scaling)
`--spectral-ball-retract-mode`	hard	Retraction mode: hard (project to sphere) or dynamic
`--spectral-ball-retract-alpha`	0.05	Step size for dynamic retraction

5. Module Granularity Options

Argument	Default	Description
`--spectral-mup-init`	-	Enable spectral μP initialization for weights
`--spectral-ball-no-split-qkv`	(enabled)	Disable splitting QKV parameters
`--spectral-ball-qkv-split-mode`	component	QKV split: component, group, or head
`--spectral-ball-no-split-fc1`	(enabled)	Disable splitting gate/up in SwiGLU
`--spectral-ball-no-split-moe-experts`	(enabled)	Disable per-expert splitting in MoE

5.4 Model "intrinsic Health" Monitors

We support logging metrics below for monitoring training stability. Note that MoE max-vio and module spectral norm are logged by default.

# log optimizer update rms before lr scaler
--log-per-module-update-rms

--log-per-module-grad-rms

--log-hidden-states embeddings input_layernorm attention::linear_qkv \
    attention::linear_q attention::linear_k attention::linear_v \
    attention::core_attention attention::o_proj pre_mlp_layernorm mlp

# Log parameter statistics
--log-params attention::linear_qkv attention::o_proj mlp::linear_fc1 \
    mlp::linear_fc2 input_layernorm pre_mlp_layernorm embedding lm_head

5.5 Benchmark Evaluation

We support downstream task evaluation during training:

--benchmark-eval
--benchmark-tasks "sciq_rc_0shot,piqa_rc_0shot,winogrande_rc_0shot,arc_easy_rc_0shot,boolq_rc_0shot,logiqa_rc_0shot,lambada_ppl_0shot,hellaswag_rc_5shot,arc_challenge_rc_5shot"

6. Acknowledgement

We gratefully acknowledge the developers of Emerging-Optimizers and Megatron-LM

7. License

This project is licensed under the Apache License 2.0.

8. Contact

If you have any questions, please raise an issue or contact Unakar