Controlled LLM Training on Spectral Sphere

January 14, 2026 · View on GitHub

arXiv | Megatron-LM | WandB

1. Introduction

This repository contains the official implementation for the paper: Controlled LLM Training on Spectral Sphere.

Abstract: Scaling large models requires optimization strategies that ensure rapid convergence grounded in stability. Maximal Update Parametrization (μP) provides a theoretical safeguard for width-invariant Θ(1) activation control, whereas emerging optimizers like Muon are only "half-aligned" with these constraints: they control updates but allow weights to drift. To address this limitation, we introduce the Spectral Sphere Optimizer (SSO), which enforces strict module-wise spectral constraints on both weights and their updates. By deriving the steepest descent direction on the spectral sphere, SSO realizes a fully μP-aligned optimization process. To enable large-scale training, we implement SSO as an efficient parallel algorithm within Megatron. Through extensive pretraining on diverse architectures, including Dense 1.7B, MoE 8B-A1B, and 200-layer DeepNet models, SSO consistently outperforms AdamW and Muon. Furthermore, we observe significant practical stability benefits, including improved MoE router load balancing, suppressed outliers, and strictly bounded activations. Megatron Code is available at SSO Pretrain.

Key Contributions:

  • Better Convergence: Outperforms AdamW and Muon with a substantial margin in Dense 1.7B, MoE 8B and 200-layer DeepNet, while keeping the "healthiest" model intrinsic metric
  • Controlled Stability: Both weights and updates satisfy μP constraints, offer tunable sphere radius, suppressed outliers and controlled activation scales in favour of low-precision training
  • System Efficiency: Atomic Module Sharding, Adaptive Kernel Dispatcher, Cached Singular Vectors, etc. MuonSphere variant retains equivalent activation control with minimal overhead

2. Algorithm

SSO performs steepest descent under the spectral norm, constraining both the weights and the updates to a spectral sphere of radius R = Θ(√(d_out/d_in)).

3. WandB Runs

DescriptionLink
Main Experiments on Dense, MoE, DeepNetBaselines
μP Learning Rate Transfer Grid SearchMuP Search
Spectral Radius Search for Tunable Activation ScaleRadius Search

4. Evaluation

Learning Rate Transfer


Controllable Activation Scale


Dense 1.7B Eval


MoE 8B-a1B Eval


5. Usage

5.1 Megatron-LM Integration

SSO is implemented in our fork of Megatron-LM. Use --optimizer spectral_ball_dist for distributed training.

5.2 Hyperparameters

ArgumentDefaultDescription
--spectral-ball-momentum0.9Momentum coefficient
--spectral-ball-use-nesterovTrueUse Nesterov-style momentum
--spectral-ball-msign-steps8Newton-Schulz iterations for matrix sign
--spectral-ball-solverbisectionLagrange multiplier solver method
--spectral-ball-solver-tolerance-f1e-8Solver tolerance
--spectral-ball-solver-max-iterations20Maximum solver iterations
--spectral-ball-power-iteration-steps20Power iteration steps for top singular vectors
--spectral-ball-radius-modespectral_mupMode for computing target radius R
--spectral-ball-radius-scaler1.0Scale factor for target radius
--spectral-ball-scale-modespectral_mupLR scale mode (spectral_mup, align_adamw_rms, shape_scaling)
--spectral-ball-retract-modehardRetraction mode: hard (project to sphere) or dynamic
--spectral-ball-retract-alpha0.05Step size for dynamic retraction

5. Module Granularity Options

ArgumentDefaultDescription
--spectral-mup-init-Enable spectral μP initialization for weights
--spectral-ball-no-split-qkv(enabled)Disable splitting QKV parameters
--spectral-ball-qkv-split-modecomponentQKV split: component, group, or head
--spectral-ball-no-split-fc1(enabled)Disable splitting gate/up in SwiGLU
--spectral-ball-no-split-moe-experts(enabled)Disable per-expert splitting in MoE

5.4 Model "intrinsic Health" Monitors

We support logging metrics below for monitoring training stability. Note that MoE max-vio and module spectral norm are logged by default.

# log optimizer update rms before lr scaler
--log-per-module-update-rms

--log-per-module-grad-rms

--log-hidden-states embeddings input_layernorm attention::linear_qkv \
    attention::linear_q attention::linear_k attention::linear_v \
    attention::core_attention attention::o_proj pre_mlp_layernorm mlp

# Log parameter statistics
--log-params attention::linear_qkv attention::o_proj mlp::linear_fc1 \
    mlp::linear_fc2 input_layernorm pre_mlp_layernorm embedding lm_head

5.5 Benchmark Evaluation

We support downstream task evaluation during training:

--benchmark-eval
--benchmark-tasks "sciq_rc_0shot,piqa_rc_0shot,winogrande_rc_0shot,arc_easy_rc_0shot,boolq_rc_0shot,logiqa_rc_0shot,lambada_ppl_0shot,hellaswag_rc_5shot,arc_challenge_rc_5shot"

6. Acknowledgement

We gratefully acknowledge the developers of Emerging-Optimizers and Megatron-LM

7. License

This project is licensed under the Apache License 2.0.

8. Contact

If you have any questions, please raise an issue or contact Unakar