Learning state of the art deep learning optimization algorithms.

A zij (Arabic: زِيج, pronounced "zeej") is an astronomical handbook from the
Islamic golden age: a set of tables and computational methods that astronomers
consulted instead of re-deriving the field from scratch. The best known is the
Zīj al-Sindhind of Muḥammad ibn Mūsā al-Khwārizmī (محمد بن موسى الخوارزمي,
c. 820 CE). His Latinized name, Algoritmi, became the word algorithm, and his
book al-Jabr gave us algebra.
This project takes the name in that spirit. It gathers the equation, the paper,
and runnable code for the optimization methods used in machine learning.
pip install zij
From source, with the pinned environment:
git clone https://github.com/junaidaliop/zij.git
cd zij
conda env create -f environment.yml
conda activate zij-optim
import zij
# torch.optim, vendored at tag v2.12.0
opt = zij.AdamW(model.parameters(), lr=3e-4)
# research optimizers, same interface
opt = zij.Muon([p for p in model.parameters() if p.ndim == 2], lr=2e-2)
opt = zij.Prodigy(model.parameters()) # no learning rate to set
opt = zij.SAM(model.parameters(), base_optimizer=zij.SGD, lr=0.1, rho=0.05)
# memory-efficient low-rank training (per-group rank)
opt = zij.GaLoreAdamW(
[{"params": params, "rank": 128, "update_proj_gap": 200, "scale": 0.25, "proj_type": "std"}],
lr=1e-2,
)
# look up by name
zij.list_optimizers("adam*")
opt_cls = zij.load_optimizer("soap")
zij.optim mirrors torch.optim, so zij.optim.AdamW is the same class as
zij.AdamW, and zij.optim.lr_scheduler is available. Use whichever import
reads better in your code.
Note
A few families use a documented non-standard call protocol. Schedule-Free needs
opt.train() and opt.eval(); the SAM family takes a closure or an explicit
first_step / second_step pair; Adam-mini and LOMO are built from a model
rather than a parameter list. Each class docstring states which.
The PyTorch package ships 106 ready-to-use optimizers. zij.core mirrors
torch.optim at tag v2.12.0 (Adam, AdamW, SGD, Muon, LBFGS, Adafactor, and the
rest, plus lr_scheduler and swa_utils). zij.contrib adds research methods
grouped by family: first-order, second-order, memory-efficient,
learning-rate-free, and sharpness-aware. In every Canon table below, the zij
column names the class where an implementation exists; a dash (—) means the
method is listed but not yet implemented (paper-only, or its source is under a
license that cannot be vendored).
zij is a PyTorch library today. The Canon is framework-agnostic: it covers each
method regardless of the framework of its original code. JAX and TensorFlow ports
are planned and will follow the same standards.
The Canon below covers 740 methods across 11 categories. Each row
records the canonical name, venue, paper, the best available implementation, and
the zij class where one exists.
First-order optimizers update parameters using only gradients and accumulated gradient statistics such as momentum and second-moment estimates. This page covers the stochastic gradient descent lineage, the Adam family, and more recent sign-based and variance-reduced methods. The zij column gives the class name for optimizers already implemented in the package.
| Optimizer | Venue | Paper | Code | zij |
|---|
| ASGD | SIAM Journal on Control and Optimization 1992 | Acceleration of Stochastic Approximation by Averaging | community | ASGD |
| Rprop | ICNN 1993 | A direct adaptive method for faster backpropagation learning: the RPROP algorithm | community | Rprop |
| Adagrad | JMLR 2011 | Adaptive Subgradient Methods for Online Learning and Stochastic Optimization | community | Adagrad |
| Adadelta | arXiv 2012 | ADADELTA: An Adaptive Learning Rate Method | community | Adadelta |
| RMSprop | Lecture notes 2012 | Lecture 6.5-rmsprop: Divide the gradient by a running average of its recent magnitude | community | RMSprop |
| FTRL | KDD 2013 | Ad Click Prediction: a View from the Trenches | — | — |
| SGD | ICML 2013 | On the importance of initialization and momentum in deep learning | community | SGD |
| Adam | ICLR 2015 | Adam: A Method for Stochastic Optimization | community | Adam |
| AdaMax | ICLR 2015 | Adam: A Method for Stochastic Optimization | community | Adamax |
| Nadam | ICLR Workshop 2016 | Incorporating Nesterov Momentum into Adam | community | NAdam |
| LARS | arXiv 2017 | Large Batch Training of Convolutional Networks | community | LARS |
| SWATS | arXiv 2017 | Improving Generalization Performance by Switching from Adam to SGD | community | SWATS |
| A2Grad | arXiv 2018 | Optimal Adaptive and Accelerated Stochastic Gradient Descent | community | A2GradUni, A2GradInc, A2GradExp |
| AccSGD | ICLR 2018 | On the insufficiency of existing momentum schemes for Stochastic Optimization | official | AccSGD |
| AMSGrad | ICLR 2018 | On the Convergence of Adam and Beyond | community | — |
| GADAM | arXiv 2018 | GADAM: Genetic-Evolutionary ADAM for Deep Neural Network Optimization | — | — |
| M-SVAG | ICML 2018 | Dissecting Adam: The Sign, Magnitude and Variance of Stochastic Gradients | official | — |
| PID | CVPR 2018 | A PID Controller Approach for Stochastic Optimization of Deep Networks | official | PID |
| VR-SGD | IEEE TKDE 2018 | VR-SGD: A Simple Stochastic Variance Reduction Method for Machine Learning | — | — |
| Yogi | NeurIPS 2018 | Adaptive Methods for Nonconvex Optimization | community | Yogi |
| AdaBound | ICLR 2019 | Adaptive Gradient Methods with Dynamic Bound of Learning Rate | official | AdaBound, AdaBoundW |
| AdaMod | arXiv 2019 | An Adaptive and Momental Bound Method for Stochastic Learning | official | AdaMod |
| AdamW | ICLR 2019 | Decoupled Weight Decay Regularization | official | AdamW |
| AdaShift | ICLR 2019 | AdaShift: Decorrelation and Convergence of Adaptive Learning Rate Methods | community | AdaShift |
| AggMo | ICLR 2019 | Aggregated Momentum: Stability Through Passive Damping | official | AggMo |
| AvaGrad | arXiv 2019 | Domain-independent Dominance of Adaptive Methods | official | AvaGrad |
| HAdam | NeurIPS Workshop 2019 | On Higher-order Moments in Adam | — | — |
| HyperAdam | AAAI 2019 | HyperAdam: A Learnable Task-Adaptive Adam for Network Training | — | — |
| Lookahead | NeurIPS 2019 | Lookahead Optimizer: k steps forward, 1 step back | community | Lookahead |
| NosAdam | IJCAI 2019 | Nostalgic Adam: Weighting more of the past gradients when designing the adaptive learning rate | — | — |
| NovoGrad | arXiv 2019 | Stochastic Gradient Methods with Layer-wise Adaptive Moments for Training of Deep Networks | community | NovoGrad |
| QHAdam / QHM | ICLR 2019 | Quasi-hyperbolic momentum and Adam for deep learning | official | QHAdam, QHM |
| Ranger | — | RAdam and Lookahead combination | official | Ranger |
| Sadam | arXiv 2019 | Calibrating the Adaptive Learning Rate to Improve Convergence of ADAM | — | — |
| AdaBelief | NeurIPS 2020 | AdaBelief Optimizer: Adapting Stepsizes by the Belief in Observed Gradients | official | AdaBelief |
| Adam+ | arXiv 2020 | Adam+: A Stochastic Method with Adaptive Variance Reduction | — | — |
| AdamBS | NeurIPS 2020 | Adam with Bandit Sampling for Deep Learning | — | — |
| AdaSGD | arXiv 2020 | AdaSGD: Bridging the gap between SGD and Adam | — | — |
| Cayley SGD | ICLR 2020 | Efficient Riemannian Optimization on the Stiefel Manifold via the Cayley Transform | official | — |
| clipped-SGD | NeurIPS 2020 | Stochastic Optimization with Heavy-Tailed Noise via Accelerated Gradient Clipping | official | — |
| DEAM | ASONAM 2020 | DEAM: Adaptive Momentum with Discriminative Weight for Stochastic Optimization | — | — |
| diffGrad | IEEE TNNLS 2020 | diffGrad: An Optimization Method for Convolutional Neural Networks | official | DiffGrad |
| EAdam | arXiv 2020 | EAdam Optimizer: How ε Impact Adam | official | — |
| Fromage | NeurIPS 2020 | On the distance between two neural networks and the stability of learning | official | — |
| Gradient Centralization (GC) | ECCV 2020 | Gradient Centralization: A New Optimization Technique for Deep Neural Networks | official | — |
| LAMB | ICLR 2020 | Large Batch Optimization for Deep Learning: Training BERT in 76 minutes | community | Lamb |
| LaProp | arXiv 2020 | LaProp: Separating Momentum and Adaptivity in Adam | official | LaProp |
| NIGT | ICML 2020 | Momentum Improves Normalized SGD | official | — |
| Padam | IJCAI 2020 | Closing the Generalization Gap of Adaptive Gradient Methods in Training Deep Neural Networks | official | PAdam |
| signSGD | ICML 2018 | signSGD: Compressed Optimisation for Non-Convex Problems | community | SignSGD |
| pbSGD | IJCAI 2020 | pbSGD: Powered Stochastic Gradient Descent Methods for Accelerated Non-Convex Optimization | official | — |
| PCGrad | NeurIPS 2020 | Gradient Surgery for Multi-Task Learning | official | — |
| RAdam | ICLR 2020 | On the Variance of the Adaptive Learning Rate and Beyond | official | RAdam |
| SGD-G2 | ICPR 2020 | Stochastic Runge-Kutta methods and adaptive SGD-G2 stochastic gradient descent | — | — |
| ACMo | AAAI 2021 | ACMo: Angle-Calibrated Moment Methods for Stochastic Optimization | — | — |
| ACProp | NeurIPS 2021 | Momentum Centering and Asynchronous Update for Adaptive Gradient Methods | official | — |
| AdaL | arXiv 2021 | AdaL: Adaptive Gradient Transformation Contributes to Convergences and Generalizations | — | — |
| AdamD | arXiv 2021 | AdamD: Improved bias-correction in Adam | — | — |
| AdamP | ICLR 2021 | AdamP: Slowing Down the Slowdown for Momentum Optimizers on Scale-invariant Weights | official | AdamP |
| Adaptive Gradient Clipping (AGC) | ICML 2021 | High-Performance Large-Scale Image Recognition Without Normalization | official | — |
| AngularGrad | arXiv 2021 | AngularGrad: A New Optimization Technique for Angular Convergence of Convolutional Neural Networks | official | — |
| BGADAM | IJCNN 2021 | BGADAM: Boosting based Genetic-Evolutionary ADAM for Neural Network Optimization | — | — |
| Gravity | arXiv 2021 | Gravity Optimizer: a Kinematic Approach on Optimization in Deep Learning | official | Gravity |
| MADGRAD | arXiv 2021 | Adaptivity without Compromise: A Momentumized, Adaptive, Dual Averaged Gradient Method for Stochastic Optimization | official | MADGRAD, MirrorMADGRAD |
| MaxVA | ECML PKDD 2021 | MaxVA: Fast Adaptation of Step Sizes by Maximizing Observed Variance of Gradients | official | — |
| Nero | ICML 2021 | Learning by Turning: Neural Architecture Aware Optimisation | official | — |
| PNM | ICML 2021 | Positive-Negative Momentum: Manipulating Stochastic Gradient Noise to Improve Generalization | official | — |
| AdaPNM | ICML 2021 | Positive-Negative Momentum: Manipulating Stochastic Gradient Noise to Improve Generalization | official | AdaPNM |
| Ranger21 | arXiv 2021 | Ranger21: a synergistic deep learning optimizer | official | Ranger21 |
| SGDP | ICLR 2021 | AdamP: Slowing Down the Slowdown for Momentum Optimizers on Scale-invariant Weights | official | SGDP |
| AdaFamily | arXiv 2022 | AdaFamily: A family of Adam-like adaptive gradient methods | — | — |
| Adai | ICML 2022 | Adaptive Inertia: Disentangling the Effects of Adaptive Learning Rate and Momentum | official | Adai |
| AdamMC | CVMI 2022 | Moment Centralization based Gradient Descent Optimizers for Convolutional Neural Networks | — | — |
| Adan | arXiv 2022 | Adan: Adaptive Nesterov Momentum Algorithm for Faster Optimizing Deep Models | official | Adan |
| AdaSmooth | arXiv 2022 | AdaSmooth: An Adaptive Learning Rate Method based on Effective Ratio | — | AdaSmooth |
| AEGDM | Annals of Applied Mathematics 2022 | An Adaptive Gradient Method with Energy and Momentum | official | — |
| Amos | arXiv 2022 | Amos: An Adam-style Optimizer with Adaptive Weight Decay towards Model-Oriented Scale | official | Amos |
| GDA-AM | ICLR 2022 | GDA-AM: On the effectiveness of solving minimax optimization via Anderson Acceleration | official | — |
| KOALA | AAAI 2022 | KOALA: A Kalman Optimization Algorithm with Loss Adaptivity | official | — |
| RotoGrad | ICLR 2022 | RotoGrad: Gradient Homogenization in Multitask Learning | official | — |
| SRSGD | SIAM Journal on Imaging Sciences 2022 | Scheduled Restart Momentum for Accelerated Stochastic Gradient Descent | — | — |
| Step-Tuned SGD | Neural Processing Letters 2022 | Second-order step-size tuning of SGD for non-convex optimization | — | — |
| AdaInject | IEEE TAI 2023 | AdaInject: Injection Based Adaptive Gradient Descent Optimizers for Convolutional Neural Networks | official | — |
| AdaNorm | WACV 2023 | AdaNorm: Adaptive Gradient Norm Correction based Optimizer for CNNs | official | AdaNorm |
| AGD | NeurIPS 2023 | AGD: an Auto-switchable Optimizer using Stepwise Gradient Difference for Preconditioning Matrix | — | — |
| Aida | TMLR 2023 | A DNN Optimizer that Improves over AdaBelief by Suppression of the Adaptive Stepsize Range | official | — |
| Lion | NeurIPS 2023 | Symbolic Discovery of Optimization Algorithms | official | Lion |
| Lookaround | NeurIPS 2023 | Lookaround Optimizer: k steps around, 1 step average | — | — |
| MultiAdam | ICML 2023 | MultiAdam: Parameter-wise Scale-invariant Optimizer for Multiscale Training of Physics-informed Neural Networks | — | — |
| RLEKF | AAAI 2023 | RLEKF: An Optimizer for Deep Potential with Ab Initio Accuracy | — | — |
| Scheduled Weight Decay (SWD) | NeurIPS 2023 | On the Overlooked Pitfalls of Weight Decay and How to Mitigate Them: A Gradient-Norm Perspective | official | — |
| SGDF | arXiv 2023 | Signal Processing Meets SGD: From Momentum to Filter | — | — |
| StableAdamW | NeurIPS 2023 | Stable and low-precision training for large-scale vision-language models | community | StableAdamW |
| AdaAct | ICDMW 2024 | An Adaptive Method Stabilizing Activations for Enhanced Generalization | — | — |
| Adam-atan2 | ICML 2024 | Scaling Exponents Across Parameterizations and Optimizers | community | AdamAtan2 |
| Adam-Rel | NeurIPS 2024 | Adam on Local Time: Addressing Nonstationarity in RL with Relative Adam Timesteps | — | — |
| AdEMAMix | arXiv 2024 | The AdEMAMix Optimizer: Better, Faster, Older | official | AdEMAMix |
| ADOPT | NeurIPS 2024 | ADOPT: Modified Adam Can Converge with Any β₂ with the Optimal Rate | official | ADOPT |
| AGS-GD | arXiv 2024 | Anisotropic Gaussian Smoothing for Gradient-based Optimization | — | — |
| BADM | arXiv 2024 | BADM: Batch ADMM for Deep Learning | — | — |
| CaAdam | arXiv 2024 | CaAdam: Improving Adam optimizer using connection aware methods | official | — |
| CAdam | arXiv 2024 | CAdam: Confidence-Based Optimization for Online Learning | — | — |
| Cautious Optimizers | arXiv 2024 | Cautious Optimizers: Improving Training with One Line of Code | official | — |
| EXAdam | arXiv 2024 | EXAdam: The Power of Adaptive Cross-Moments | official | EXAdam |
| FAdam | arXiv 2024 | FAdam: Adam is a natural gradient optimizer using diagonal empirical Fisher information | community | FAdam |
| GrokAdamW | — | AdamW variant with Grokfast-style gradient amplification | official | GrokAdamW |
| Grokfast | arXiv 2024 | Grokfast: Accelerated Grokking by Amplifying Slow Gradients | official | — |
| INNAprop | arXiv 2024 | A second-order-like optimizer with adaptive gradient scaling for deep learning | official | — |
| KATE | NeurIPS 2024 | Remove that Square Root: A New Efficient Scale-Invariant Version of AdaGrad | official | — |
| MADA | ICML 2024 | MADA: Meta-Adaptive Optimizers through hyper-gradient Descent | — | — |
| RSGDM | CCSB 2024 | Reducing Bias in Deep Learning Optimization: The RSGDM Approach | — | — |
| SET-Adam | ECML PKDD 2024 | On Suppressing Range of Adaptive Stepsizes of Adam to Improve Generalisation Performance | — | — |
| SNGM | Science China Information Sciences 2024 | Stochastic Normalized Gradient Descent with Momentum for Large-Batch Training | — | — |
| SRMM | JMLR 2024 | Stochastic regularized majorization-minimization with weakly convex and multi-convex surrogates | official | — |
| TAM | arXiv 2024 | Torque-Aware Momentum | — | — |
| WarpAdam | arXiv 2024 | WarpAdam: A new Adam optimizer based on Meta-Learning approach | — | — |
| AbsSADMM | arXiv 2025 | Stochastic ADMM with batch size adaptation for nonconvex nonsmooth optimization | — | — |
| AdamC | arXiv 2025 | Why Gradients Rapidly Increase Near the End of Training | — | — |
| AdamNX | arXiv 2025 | AdamNX: An Adam improvement algorithm based on a novel exponential decay mechanism for the second-order moment estimate | official | — |
| AdamS | EMNLP 2025 | AdamS: Momentum Itself Can Be A Normalizer for LLM Pretraining and Post-training | — | — |
| adaNAPG | arXiv 2025 | Boosting Accelerated Proximal Gradient Method with Adaptive Sampling for Stochastic Composite Optimization | — | — |
| Ano | arXiv 2025 | ANO : Faster is Better in Noisy Landscape | official | — |
| BCOS | arXiv 2025 | Stochastic Approximation with Block Coordinate Optimal Stepsizes | official | — |
| Cautious Weight Decay | arXiv 2025 | Cautious Weight Decay | community | — |
| Conda | arXiv 2025 | Conda: Column-Normalized Adam for Training Large Language Models Faster | official | — |
| Coupled Adam | ACL 2025 | Better Embeddings with Coupled Adam | — | — |
| DecGD | Machine Learning 2025 | A New Adaptive Gradient Method with Gradient Decomposition | — | — |
| DEO | arXiv 2025 | Dimer-Enhanced Optimization: A First-Order Approach to Escaping Saddle Points in Neural Network Training | official | — |
| EmoNavi | — | An emotion-driven optimizer that feels loss and navigates accordingly | official | — |
| MARS | ICML 2025 | MARS: Unleashing the Power of Variance Reduction for Training Large Models | official | MARS |
| FOCUS | arXiv 2025 | FOCUS: First Order Concentrated Updating Scheme | official | FOCUS |
| FSGDM | ICLR 2025 | On the Performance Analysis of Momentum Method: A Frequency Domain Perspective | — | — |
| Grams | ICLR Workshop 2025 | Grams: Gradient Descent with Adaptive Momentum Scaling | official | Grams |
| HGM | arXiv 2025 | Hindsight-Guided Momentum (HGM) Optimizer: An Approach to Adaptive Learning Rate | — | — |
| HVAdam | AAAI 2025 | HVAdam: A Full-Dimension Adaptive Optimizer | — | — |
| KO | arXiv 2025 | KO: Kinetics-inspired Neural Optimizer with PDE Simulation Approaches | — | — |
| KOALA++ | NeurIPS 2025 | KOALA++: Efficient Kalman-Based Optimization with Gradient-Covariance Products | — | — |
| Kourkoutas-Beta | arXiv 2025 | Kourkoutas-Beta: A Sunspike-Driven Adam Optimizer with Desert Flair | official | KourkoutasSoftmaxFlex |
| MIAdam | AAAI 2025 | A Method for Enhancing Generalization of Adam by Multiple Integrations | official | — |
| μ²-SGD | ICLR 2025 | Do Stochastic, Feel Noiseless: Stable Stochastic Optimization via a Double Momentum Mechanism | — | — |
| ⊥Grad (OrthoGrad) | ICLR 2025 | Grokking at the Edge of Numerical Stability | official | — |
| Overshoot | arXiv 2025 | Overshoot: Taking advantage of future gradients in momentum-based stochastic optimization | official | — |
| PadamP | arXiv 2025 | Adaptive Moment Estimation Optimization Algorithm Using Projection Gradient for Deep Learning | — | — |
| Simplified-AdEMAMix | arXiv 2025 | Connections between Schedule-Free Optimizers, AdEMAMix, and Accelerated SGD Variants | official | — |
| LyAm | arXiv 2025 | LyAm: Robust Non-Convex Optimization for Stable Learning in Noisy Environments | — | — |
| NIRMAL | arXiv 2025 | Comparative Analysis of Novel NIRMAL Optimizer Against Adam and SGD with Momentum | — | — |
| SCSAdamW | arXiv 2025 | Beyond First-Order: Training LLMs with Stochastic Conjugate Subgradients and AdamW | official | — |
| SKA-SGD | arXiv 2025 | Streaming Krylov-Accelerated Stochastic Gradient Descent | — | — |
| SoftSignSGD (S3) | arXiv 2025 | SoftSignSGD(S3): An Enhanced Optimizer for Practical DNN Training and Loss Spikes Minimization Beyond Adam | — | — |
| SPAM | arXiv 2025 | SPAM: Spike-Aware Adam with Momentum Reset for Stable LLM Training | official | — |
| VSGD | TMLR 2025 | Variational Stochastic Gradient Descent for Deep Neural Networks | official | — |
| ZetA | arXiv 2025 | ZetA: A Riemann Zeta-Scaled Extension of Adam for Deep Learning | — | — |
| AdaGC | ICML 2026 | AdaGC: Enhancing LLM Pretraining Stability via Adaptive Gradient Clipping | — | AdaGC |
| Anon | arXiv 2026 | Anon: Extrapolating Adaptivity Beyond SGD and Adam | — | — |
| C-Adam | arXiv 2026 | A Theoretical and Experimental Study of a Novel Adaptive Learning Algorithm | — | — |
| DualAdam | arXiv 2026 | Combining Adam and its Inverse Counterpart to Enhance Generalization of Deep Learning Optimizers | official | — |
| FANoS | arXiv 2026 | FANoS-v2: Feedback-Controlled Momentum with Thermostat Damping for Lightweight Neural Optimization | official | — |
| GradPower | ICML 2026 | GradPower: Powering Gradients for Faster Language Model Pre-Training | — | — |
| HomeAdam | arXiv 2026 | HomeAdam: Adam and AdamW Algorithms Sometimes Go Home to Obtain Better Provable Generalization | — | — |
| NOVAK | arXiv 2026 | NOVAK: Unified adaptive optimizer for deep neural networks | — | — |
| PS-Clip-SGD | arXiv 2026 | Robust and Fast Training via Per-Sample Clipping | — | — |
| SparseOpt | ICML 2026 | SparseOpt: Addressing Normalization-induced Gradient Skew in Sparse Training | — | — |
| Stable-SPAM / GradientStabilizer | ICML 2026 | GradientStabilizer: Fix the Norm, Not the Gradient | official | — |
| VRAdam | ICLR 2026 | A Physics-Inspired Optimizer: Velocity Regularized Adam | official | — |
| SparseAdam | — | Adam variant for sparse gradients | official | SparseAdam |
| OptMuon | arXiv 2026 | OptMuon: Closed-Loop Orthogonalized Momentum Methods for Stochastic Optimization with Zero-Noise Optimality | — | — |
| FOGO | arXiv 2026 | FOGO: Forgetting-aware Orthogonalization Optimizer | — | — |
| AdamO | ICML 2026 | Preserving Plasticity in Continual Learning via Dynamical Isometry | — | — |
| MAdam | arXiv 2026 | MAdam: Metric-Aware Multi-Objective Adam | — | — |
| MuCon | arXiv 2026 | MuCon: Clipped Muon Updates for LLM Training | — | — |
| NuMuon | arXiv 2026 | NuMuon: Nuclear-Norm-Constrained Muon for Compressible LLM Training | — | — |
| MiMuon | arXiv 2026 | MiMuon: Mixed Muon Optimizer with Improved Generalization for Large Models | — | — |
| Pion | arXiv preprint (cs.LG, stat.ML) 2026 | Pion: A Spectrum-Preserving Optimizer via Orthogonal Equivalence Transformation | official | — |
| iMuon (Intrinsic Muon) | arXiv 2026 | Intrinsic Muon: Spectral Optimization on Riemannian Matrix Manifolds | official | — |
| Muon-OGD | arXiv 2026 | Muon-OGD: Muon-based Spectral Orthogonal Gradient Projection for LLM Continual Learning | — | — |
| Newton-Muon | arXiv 2026 | The Newton-Muon Optimizer | official | — |
| MuonEq | arXiv 2026 | MuonEq: Balancing Before Orthogonalization with Lightweight Equilibration | official | — |
| RMNP | arXiv 2026 | RMNP: Row-Momentum Normalized Preconditioning for Scalable Matrix-Based Optimization | official | — |
| MUD | arXiv preprint 2026 | Beyond Muon: MUD (MomentUm Decorrelation) for Faster Transformer Training | — | — |
| NAMO | arXiv 2026 | Adam Improves Muon: Adaptive Moment Estimation with Orthogonalized Momentum | official | — |
| SpecMuon | arXiv 2026 | Muon with Spectral Guidance: Efficient Optimization for Scientific Machine Learning | — | — |
| ARO | arXiv 2026 | ARO: A New Lens On Matrix Optimization For Large Models | — | — |
| PRISM | arXiv 2026 | PRISM: Structured Optimization via Anisotropic Spectral Shaping | — | — |
| MCSD / SPEL | arXiv 2026 | Manifold constrained steepest descent | — | — |
| Variance-Adaptive Muon (Muon-NSR / Muon-VS) | arXiv 2026 | Variance-Adaptive Muon: Accelerating LLM Pretraining with NSR-Modulated and Variance-Scaled Momentum | — | — |
| MuonAll | arXiv 2025 | MuonAll: Muon Variant for Efficient Finetuning of Large Language Models | official | — |
| Gluon | arXiv 2025 (also accepted at ICML 2025 HiLD workshop) | Gluon: Making Muon & Scion Great Again! (Bridging Theory and Practice of LMO-based Optimizers for LLMs) | — | — |
| LPSGD / LPSGDM | arXiv 2026 | Beyond L2-norm and L-infinity-norm: A Curvature-Inspired ell_p-Norm Scheme for Deep Neural Networks | — | — |
| ABSignSGD | ICLR 2026 | Arbitrary-Order Block SignSGD for Memory-Efficient LLM Fine-Tuning | — | — |
| StoSignSGD | arXiv 2026 | StoSignSGD: Unbiased Structural Stochasticity Fixes SignSGD for Training Large Language Models | — | — |
| Hybrid SignSGD-SGD switching | arXiv 2026 | Enhancing SignSGD: Small-Batch Convergence Analysis and a Hybrid Switching Strategy | — | — |
| SoftSignum / SoftMuon | ICML 2026 | Softsign: Smooth Sign in Your Optimizer For Better Parameter Heterogeneity Handling | official | — |
| Accelerated SignGD | arXiv 2025 | Norm-Constrained Flows and Sign-Based Optimization: Theory and Algorithms | — | — |
| CLion | arXiv 2026 | CLion: Efficient Cautious Lion Optimizer with Enhanced Generalization | — | — |
| OLion | arXiv 2026 | OLion: Approaching the Hadamard Ideal by Intersecting Spectral and ell_{infty} Implicit Biases | official | — |
| MGUP | NeurIPS 2025 | MGUP: A Momentum-Gradient Alignment Update Policy for Stochastic Optimization | official | — |
| Magma | arXiv 2026 | On Surprising Effectiveness of Masking Updates in Adaptive Optimizers | — | — |
| AGGC | ACL 2026 | AGGC: Adaptive Group Gradient Clipping for Stabilizing Large Language Model Training | official | — |
| Clipped Scion | NeurIPS 2025 | Generalized Gradient Norm Clipping & Non-Euclidean (L_0,L_1)-Smoothness | official | — |
| SPECTRA | ICML 2026 | Enhancing LLM Training via Spectral Clipping | official | — |
| Spectral Clipping (matrix-valued) | arXiv 2026 | Gradient Clipping Beyond Vector Norms: A Spectral Approach for Matrix-Valued Parameters | — | — |
| SPAMP | ACM Multimedia Asia 2025 (7th ACM International Conference on Multimedia in Asia) | Gradient Shaping Beyond Clipping: A Functional Perspective on Update Magnitude Control | — | — |
| NucGD | arXiv 2026 | Towards The Implicit Bias on Multiclass Separable Data Under Norm Constraints | official | — |
| Batched / Transported Scion | arXiv 2026 | Scale-Invariant Neural Network Optimization: Norm Geometry and Heavy-Tailed Noise | — | — |
| EMA bias-corrected iterate averaging | NeurIPS 2025 Workshop (OPT 2025) | EMA Without the Lag: Bias-Corrected Iterate Averaging Schemes | — | — |
| RGrad-Avg | OPT 2025 (17th Annual Workshop on Optimization for Machine Learning, co-located with NeurIPS 2025) | On Riemannian Gradient Descent Algorithm using gradient averaging | — | — |
| SGD with adaptive preconditioning | ICLR 2026 | SGD with Adaptive Preconditioning: Unified Analysis and Momentum Acceleration | — | — |
| HTMuon | arXiv 2026 | HTMuon: Improving Muon via Heavy-Tailed Spectral Correction | official | — |
| MARS-M | arXiv 2025 | MARS-M: When Variance Reduction Meets Matrices | official | — |
| Drop-Muon | arXiv 2025 | Drop-Muon: Update Less, Converge Faster | — | — |
| Muon+ | arXiv 2026 | MUON+: Towards More Effective Muon via One Additional Normalization Step for LLM Pre-training | official | — |
| TrasMuon | ICLR 2026 Workshop Sci4DL | TrasMuon: Trust-Region Adaptive Scaling for Orthogonalized Momentum Optimizers | — | — |
| Adam-SHANG | arXiv 2026 | Adam-SHANG: A Convergent Adam-Type Method for Stochastic Smooth Convex Optimization | — | — |
| EMA-Nesterov | arXiv 2026 | EMA-Nesterov: Stabilizing Nesterov's Lookahead for Accelerated Deep Learning Optimization | — | — |
| S-Adam | arXiv 2026 | Singularity-aware Optimization via Randomized Geometric Probing: Towards Stable Non-smooth Optimization | — | — |
| IAdaPID-ADG | arXiv 2026 | An Improved Adaptive PID Optimizer with Enhanced Convergence and Stability for Deep Learning | — | — |
| CT-AGD | arXiv 2026 | Accelerated Gradient Descent for Faster Convergence with Minimal Overhead | — | — |
| GPA (Generalized Primal Averaging) | arXiv 2025 | Smoothing DiLoCo with Primal Averaging for Faster Training of LLMs | official | — |
| SNOO | arXiv 2025 | SNOO: Step-K Nesterov Outer Optimizer - The Surprising Effectiveness of Nesterov Momentum Applied to Pseudo-Gradients | official | — |
| Riemannion | ICLR 2026 | LoRA meets Riemannion: Muon Optimizer for Parametrization-independent Low-Rank Adapters | — | — |
| Optimal Projection-Free Adaptive SGD | arXiv 2026 | Optimal Projection-Free Adaptive SGD for Matrix Optimization | — | — |
| AdamCB | ICLR 2025 | ADAM Optimization with Adaptive Batch Selection | — | — |
| Kalman-Adam | Knowledge-Based Systems 2026 | Kalman-Adam: Optimal bayesian moment estimation for memory-Efficient and generalizable deep learning | — | — |
| AdamHD (AdamHuberDecay) | NeurIPS 2025 Workshop (ScaleOpt: GPU-Accelerated and Scalable Optimization) | AdamHD: Decoupled Huber Decay Regularization for Language Model Pre-Training | — | — |
| MVN-Grad | arXiv 2026 | Adaptive Optimization via Momentum on Variance-Normalized Gradients | — | — |
| Compositional Muon (CM) | Tilde Research blog 2026 | Towards Compositional Steepest Descent | official | — |
Memory-efficient optimizers reduce the optimizer-state memory that dominates large-model training budgets, where Adam-style methods store two extra full-precision values per parameter. The methods below cover factored second moments, 8-bit and 4-bit state quantization, low-rank gradient projection, block-coordinate updates, zeroth-order gradient estimates, and stateless update rules.
HuggingFace transformers exposes many of these methods through the optim argument of TrainingArguments. Each string value below maps to a memory-efficient optimizer; all backing libraries except the built-in Adafactor must be installed separately.
optim value | Backing library |
|---|
adafactor | transformers ships its own Adafactor implementation with relative-step and update-clipping options (Apache-2.0). |
adamw_bnb_8bit / adamw_8bit | bitsandbytes AdamW with block-wise 8-bit quantized state (MIT). |
paged_adamw_8bit / paged_adamw_32bit | bitsandbytes paged AdamW; optimizer state is paged between GPU and CPU memory (MIT). |
lion_8bit / lion_32bit / paged_lion_8bit / paged_lion_32bit | bitsandbytes Lion, single momentum buffer, with 8-bit and paged variants (MIT). |
ademamix_8bit / paged_ademamix_8bit / paged_ademamix_32bit | bitsandbytes AdEMAMix with 8-bit quantized and paged state (MIT). |
rmsprop_bnb_8bit | bitsandbytes RMSprop with block-wise 8-bit quantized state (MIT). |
adamw_torch_4bit / adamw_torch_8bit | torchao pure-PyTorch AdamW with 4-bit or 8-bit optimizer states (BSD-3-Clause). |
galore_adamw / galore_adamw_8bit / galore_adafactor and *_layerwise variants | galore-torch, the official GaLore release (Apache-2.0). |
apollo_adamw / apollo_adamw_layerwise | apollo-torch, the official APOLLO release (CC-BY-NC-4.0). |
lomo / adalomo | lomo-optim, the official LOMO and AdaLomo release (MIT). |
Fractional-order optimizers generalize the integer-order gradient step with fractional-calculus operators, most commonly the Caputo, Riemann-Liouville, or Grünwald-Letnikov derivative, which weight past gradient information through power-law memory kernels. The field is young: the first neural-network training results date to 2015, convergence theory is still being settled, and most papers ship no code.
Note: FAdam (arXiv 2405.12807) is a Fisher-information variant of Adam and is unrelated to fractional calculus despite the name.
Optimizers in this category target training across many devices or nodes, where memory and inter-worker communication are the main bottlenecks. They shard optimizer state, compress gradient exchange, or synchronize infrequently so that training scales without a proportional increase in bandwidth. Some entries are standalone update rules, while others wrap an inner optimizer with a communication-efficient outer loop.
Second-order and orthogonalized optimizers exploit curvature information or the matrix structure of gradients rather than purely elementwise first-order statistics. This group spans quasi-Newton and Hessian-diagonal methods (L-BFGS, AdaHessian, Sophia), full-matrix and Kronecker-factored preconditioning (PSGD, Shampoo, SOAP), and orthogonalized-update methods in the Muon family. Venues reflect peer-reviewed acceptance where applicable; otherwise the arXiv year is listed.
Zeroth-order (gradient-free) methods train models using only function evaluations, estimating gradients from randomized perturbations of the parameters instead of backpropagation. Because they need no backward pass or activation storage, they run at roughly inference-level memory, which has made them a practical option for fine-tuning large language models on constrained hardware. The lineage runs from SPSA in classical stochastic approximation to recent variance-reduced and low-rank variants built on MeZO.
Privacy-preserving optimizers train models under differential privacy, typically by clipping per-sample gradients and adding calibrated noise to updates. This page lists differentially private optimization methods and reference libraries, from the original DP-SGD to later variants that reduce clipping bias, correct moment estimates, or filter privacy noise.
Sharpness-aware methods seek parameters that lie in neighborhoods with uniformly low loss rather than at isolated minima, which tends to improve generalization. Introduced by SAM (Foret et al., ICLR 2021), these methods wrap a base optimizer such as SGD or AdamW and add a gradient ascent perturbation step before the descent update. Later work makes the perturbation scale-invariant, closes the surrogate gap, reweights the sharpness term, amortizes the extra forward-backward cost, or extends the idea to second-order optimization.
This page collects optimizers from two adjacent settings. The first is the optimization of variational quantum circuits, where shot noise and the quantum geometry of the parameter space drive the design of measurement-frugal, gradient-free, and natural-gradient methods. The second is quantum-inspired and quantum-hardware optimization of classical neural networks, where quantum fluctuations, adiabatic evolution, or annealer sampling replace or augment the classical training loop.
Learning-rate-free (also called parameter-free or tuning-free) optimizers select their step size automatically during training instead of requiring a manually tuned learning rate. Most methods in this family estimate a quantity such as the distance from the initial point to the solution and set the effective step size from observed gradients, while others wrap an existing base optimizer and tune its global scale factor online. The goal is to match the performance of a well-tuned baseline without a learning-rate search.
zij.core.lr_scheduler vendors the PyTorch core learning rate schedulers under their original class names. The first table lists every vendored class, including the LRScheduler base class, with the published work it derives from where one exists. The second table covers notable schedules from the literature that zij does not yet implement.
| Scheduler | Origin |
|---|
ChainedScheduler | — |
ConstantLR | — |
CosineAnnealingLR | Loshchilov & Hutter ICLR 2017 (SGDR) |
CosineAnnealingWarmRestarts | Loshchilov & Hutter ICLR 2017 (SGDR) |
CyclicLR | Smith WACV 2017 (cyclical learning rates) |
ExponentialLR | — |
LambdaLR | — |
LinearLR | — |
LRScheduler | — |
MultiplicativeLR | — |
MultiStepLR | — |
OneCycleLR | Smith & Topin 2019 (super-convergence) |
PolynomialLR | — |
ReduceLROnPlateau | — |
SequentialLR | — |
StepLR | — |
Schedule-Free is not a schedule on top of an optimizer but a replacement for scheduling, achieved through online iterate averaging inside the optimizer; see the learning-rate-free optimizers.
Weight averaging is available separately in zij.core.swa_utils, which provides stochastic weight averaging and exponential moving average utilities (AveragedModel, SWALR, update_bn, and the SWA/EMA averaging functions), following Averaging Weights Leads to Wider Optima and Better Generalization (Izmailov et al., UAI 2018).
Two kinds of project cover this ground: curated awesome-lists, and installable
optimizer collections. zij (زِيج) is both.
| Capability | Awesome-lists | Library collections | zij |
|---|
| Curated reference of the whole field | Yes | — | Yes |
| Installable, tested implementations | — | Yes | Yes |
| Paper-only methods included | Yes | — | Yes |
| Update rule in standard notation | — | — | Yes |
| Per-file provenance (upstream, commit, license) | — | Partial | Yes |
| Dedicated fractional-order coverage | — | — | Yes |
| Dedicated quantum / quantum-inspired coverage | — | — | Yes |
- The Canon and the code are one project. Every Canon row links the paper and, where it exists, the implementation. Every implementation links back to its source and paper.
- Provenance is explicit. Vendored files record their upstream repository, pinned commit, and license; THIRD_PARTY_NOTICES.md aggregates the attributions. Sources under GPL, non-commercial, or no license are not vendored and remain listed only.
- Mathematics is explicit. Each update rule is written in standard notation. Where an official implementation diverges from its own paper, the docstring records what the code computes.
- Everything is tested. Every registered optimizer has convergence and state-dict round-trip tests.
New implementations, Canon entries, and corrections are welcome. See CONTRIBUTING.md. A Canon correction counts as much as a code change.
zij (زِيج) builds on the projects it learns from:
- APRIL-AIGC/Awesome-Optimizer: an awesome-list whose breadth helped inform this project's scope.
- kozistr/pytorch_optimizer: a comprehensive, maintained PyTorch optimizer collection, and a reference for several vendored implementations.
- jettify/pytorch-optimizer: an early community optimizer collection and the source of several classic implementations.
- timm: tested optimizer implementations and packaging conventions.
- PyTorch: the
torch.optim core that zij.core mirrors.
- The optimizer authors: each method is someone's research. The canonical paper is cited in every Canon row and class docstring, and the original repository is credited per file in THIRD_PARTY_NOTICES.md.
If you use an optimizer from this library, cite two works: the original paper of
the algorithm (linked in its Canon row and docstring), and zij as the software
you ran. The paper credits the method; the software credits the implementation.
@software{raja_zij,
author = {Raja, Muhammad Junaid Ali Asif},
title = {zij: A Canon and Library of Deep Learning Optimizers},
year = {2026},
url = {https://github.com/junaidaliop/zij}
}
Machine-readable metadata is in CITATION.cff.
Apache-2.0. Vendored components retain their original licenses; see THIRD_PARTY_NOTICES.md.
Muhammad Junaid Ali Asif Raja — muhammadjunaidaliasifraja@gmail.com