zij زِيج

June 15, 2026 · View on GitHub

zij  زِيج

Learning state of the art deep learning optimization algorithms.

License Python PyTorch

A zij (Arabic: زِيج, pronounced "zeej") is an astronomical handbook from the Islamic golden age: a set of tables and computational methods that astronomers consulted instead of re-deriving the field from scratch. The best known is the Zīj al-Sindhind of Muḥammad ibn Mūsā al-Khwārizmī (محمد بن موسى الخوارزمي, c. 820 CE). His Latinized name, Algoritmi, became the word algorithm, and his book al-Jabr gave us algebra.

This project takes the name in that spirit. It gathers the equation, the paper, and runnable code for the optimization methods used in machine learning.

Contents

Installation

pip install zij

From source, with the pinned environment:

git clone https://github.com/junaidaliop/zij.git
cd zij
conda env create -f environment.yml
conda activate zij-optim

Quick start

import zij

# torch.optim, vendored at tag v2.12.0
opt = zij.AdamW(model.parameters(), lr=3e-4)

# research optimizers, same interface
opt = zij.Muon([p for p in model.parameters() if p.ndim == 2], lr=2e-2)
opt = zij.Prodigy(model.parameters())                       # no learning rate to set
opt = zij.SAM(model.parameters(), base_optimizer=zij.SGD, lr=0.1, rho=0.05)

# memory-efficient low-rank training (per-group rank)
opt = zij.GaLoreAdamW(
    [{"params": params, "rank": 128, "update_proj_gap": 200, "scale": 0.25, "proj_type": "std"}],
    lr=1e-2,
)

# look up by name
zij.list_optimizers("adam*")
opt_cls = zij.load_optimizer("soap")

zij.optim mirrors torch.optim, so zij.optim.AdamW is the same class as zij.AdamW, and zij.optim.lr_scheduler is available. Use whichever import reads better in your code.

Note

A few families use a documented non-standard call protocol. Schedule-Free needs opt.train() and opt.eval(); the SAM family takes a closure or an explicit first_step / second_step pair; Adam-mini and LOMO are built from a model rather than a parameter list. Each class docstring states which.

Library

The PyTorch package ships 106 ready-to-use optimizers. zij.core mirrors torch.optim at tag v2.12.0 (Adam, AdamW, SGD, Muon, LBFGS, Adafactor, and the rest, plus lr_scheduler and swa_utils). zij.contrib adds research methods grouped by family: first-order, second-order, memory-efficient, learning-rate-free, and sharpness-aware. In every Canon table below, the zij column names the class where an implementation exists; a dash (—) means the method is listed but not yet implemented (paper-only, or its source is under a license that cannot be vendored).

zij is a PyTorch library today. The Canon is framework-agnostic: it covers each method regardless of the framework of its original code. JAX and TensorFlow ports are planned and will follow the same standards.

Canon

The Canon below covers 740 methods across 11 categories. Each row records the canonical name, venue, paper, the best available implementation, and the zij class where one exists.

First-Order Optimizers

First-order optimizers update parameters using only gradients and accumulated gradient statistics such as momentum and second-moment estimates. This page covers the stochastic gradient descent lineage, the Adam family, and more recent sign-based and variance-reduced methods. The zij column gives the class name for optimizers already implemented in the package.

OptimizerVenuePaperCodezij
ASGDSIAM Journal on Control and Optimization 1992Acceleration of Stochastic Approximation by AveragingcommunityASGD
RpropICNN 1993A direct adaptive method for faster backpropagation learning: the RPROP algorithmcommunityRprop
AdagradJMLR 2011Adaptive Subgradient Methods for Online Learning and Stochastic OptimizationcommunityAdagrad
AdadeltaarXiv 2012ADADELTA: An Adaptive Learning Rate MethodcommunityAdadelta
RMSpropLecture notes 2012Lecture 6.5-rmsprop: Divide the gradient by a running average of its recent magnitudecommunityRMSprop
FTRLKDD 2013Ad Click Prediction: a View from the Trenches
SGDICML 2013On the importance of initialization and momentum in deep learningcommunitySGD
AdamICLR 2015Adam: A Method for Stochastic OptimizationcommunityAdam
AdaMaxICLR 2015Adam: A Method for Stochastic OptimizationcommunityAdamax
NadamICLR Workshop 2016Incorporating Nesterov Momentum into AdamcommunityNAdam
LARSarXiv 2017Large Batch Training of Convolutional NetworkscommunityLARS
SWATSarXiv 2017Improving Generalization Performance by Switching from Adam to SGDcommunitySWATS
A2GradarXiv 2018Optimal Adaptive and Accelerated Stochastic Gradient DescentcommunityA2GradUni, A2GradInc, A2GradExp
AccSGDICLR 2018On the insufficiency of existing momentum schemes for Stochastic OptimizationofficialAccSGD
AMSGradICLR 2018On the Convergence of Adam and Beyondcommunity
GADAMarXiv 2018GADAM: Genetic-Evolutionary ADAM for Deep Neural Network Optimization
M-SVAGICML 2018Dissecting Adam: The Sign, Magnitude and Variance of Stochastic Gradientsofficial
PIDCVPR 2018A PID Controller Approach for Stochastic Optimization of Deep NetworksofficialPID
VR-SGDIEEE TKDE 2018VR-SGD: A Simple Stochastic Variance Reduction Method for Machine Learning
YogiNeurIPS 2018Adaptive Methods for Nonconvex OptimizationcommunityYogi
AdaBoundICLR 2019Adaptive Gradient Methods with Dynamic Bound of Learning RateofficialAdaBound, AdaBoundW
AdaModarXiv 2019An Adaptive and Momental Bound Method for Stochastic LearningofficialAdaMod
AdamWICLR 2019Decoupled Weight Decay RegularizationofficialAdamW
AdaShiftICLR 2019AdaShift: Decorrelation and Convergence of Adaptive Learning Rate MethodscommunityAdaShift
AggMoICLR 2019Aggregated Momentum: Stability Through Passive DampingofficialAggMo
AvaGradarXiv 2019Domain-independent Dominance of Adaptive MethodsofficialAvaGrad
HAdamNeurIPS Workshop 2019On Higher-order Moments in Adam
HyperAdamAAAI 2019HyperAdam: A Learnable Task-Adaptive Adam for Network Training
LookaheadNeurIPS 2019Lookahead Optimizer: k steps forward, 1 step backcommunityLookahead
NosAdamIJCAI 2019Nostalgic Adam: Weighting more of the past gradients when designing the adaptive learning rate
NovoGradarXiv 2019Stochastic Gradient Methods with Layer-wise Adaptive Moments for Training of Deep NetworkscommunityNovoGrad
QHAdam / QHMICLR 2019Quasi-hyperbolic momentum and Adam for deep learningofficialQHAdam, QHM
RangerRAdam and Lookahead combinationofficialRanger
SadamarXiv 2019Calibrating the Adaptive Learning Rate to Improve Convergence of ADAM
AdaBeliefNeurIPS 2020AdaBelief Optimizer: Adapting Stepsizes by the Belief in Observed GradientsofficialAdaBelief
Adam+arXiv 2020Adam+: A Stochastic Method with Adaptive Variance Reduction
AdamBSNeurIPS 2020Adam with Bandit Sampling for Deep Learning
AdaSGDarXiv 2020AdaSGD: Bridging the gap between SGD and Adam
Cayley SGDICLR 2020Efficient Riemannian Optimization on the Stiefel Manifold via the Cayley Transformofficial
clipped-SGDNeurIPS 2020Stochastic Optimization with Heavy-Tailed Noise via Accelerated Gradient Clippingofficial
DEAMASONAM 2020DEAM: Adaptive Momentum with Discriminative Weight for Stochastic Optimization
diffGradIEEE TNNLS 2020diffGrad: An Optimization Method for Convolutional Neural NetworksofficialDiffGrad
EAdamarXiv 2020EAdam Optimizer: How ε Impact Adamofficial
FromageNeurIPS 2020On the distance between two neural networks and the stability of learningofficial
Gradient Centralization (GC)ECCV 2020Gradient Centralization: A New Optimization Technique for Deep Neural Networksofficial
LAMBICLR 2020Large Batch Optimization for Deep Learning: Training BERT in 76 minutescommunityLamb
LaProparXiv 2020LaProp: Separating Momentum and Adaptivity in AdamofficialLaProp
NIGTICML 2020Momentum Improves Normalized SGDofficial
PadamIJCAI 2020Closing the Generalization Gap of Adaptive Gradient Methods in Training Deep Neural NetworksofficialPAdam
signSGDICML 2018signSGD: Compressed Optimisation for Non-Convex ProblemscommunitySignSGD
pbSGDIJCAI 2020pbSGD: Powered Stochastic Gradient Descent Methods for Accelerated Non-Convex Optimizationofficial
PCGradNeurIPS 2020Gradient Surgery for Multi-Task Learningofficial
RAdamICLR 2020On the Variance of the Adaptive Learning Rate and BeyondofficialRAdam
SGD-G2ICPR 2020Stochastic Runge-Kutta methods and adaptive SGD-G2 stochastic gradient descent
ACMoAAAI 2021ACMo: Angle-Calibrated Moment Methods for Stochastic Optimization
ACPropNeurIPS 2021Momentum Centering and Asynchronous Update for Adaptive Gradient Methodsofficial
AdaLarXiv 2021AdaL: Adaptive Gradient Transformation Contributes to Convergences and Generalizations
AdamDarXiv 2021AdamD: Improved bias-correction in Adam
AdamPICLR 2021AdamP: Slowing Down the Slowdown for Momentum Optimizers on Scale-invariant WeightsofficialAdamP
Adaptive Gradient Clipping (AGC)ICML 2021High-Performance Large-Scale Image Recognition Without Normalizationofficial
AngularGradarXiv 2021AngularGrad: A New Optimization Technique for Angular Convergence of Convolutional Neural Networksofficial
BGADAMIJCNN 2021BGADAM: Boosting based Genetic-Evolutionary ADAM for Neural Network Optimization
GravityarXiv 2021Gravity Optimizer: a Kinematic Approach on Optimization in Deep LearningofficialGravity
MADGRADarXiv 2021Adaptivity without Compromise: A Momentumized, Adaptive, Dual Averaged Gradient Method for Stochastic OptimizationofficialMADGRAD, MirrorMADGRAD
MaxVAECML PKDD 2021MaxVA: Fast Adaptation of Step Sizes by Maximizing Observed Variance of Gradientsofficial
NeroICML 2021Learning by Turning: Neural Architecture Aware Optimisationofficial
PNMICML 2021Positive-Negative Momentum: Manipulating Stochastic Gradient Noise to Improve Generalizationofficial
AdaPNMICML 2021Positive-Negative Momentum: Manipulating Stochastic Gradient Noise to Improve GeneralizationofficialAdaPNM
Ranger21arXiv 2021Ranger21: a synergistic deep learning optimizerofficialRanger21
SGDPICLR 2021AdamP: Slowing Down the Slowdown for Momentum Optimizers on Scale-invariant WeightsofficialSGDP
AdaFamilyarXiv 2022AdaFamily: A family of Adam-like adaptive gradient methods
AdaiICML 2022Adaptive Inertia: Disentangling the Effects of Adaptive Learning Rate and MomentumofficialAdai
AdamMCCVMI 2022Moment Centralization based Gradient Descent Optimizers for Convolutional Neural Networks
AdanarXiv 2022Adan: Adaptive Nesterov Momentum Algorithm for Faster Optimizing Deep ModelsofficialAdan
AdaSmootharXiv 2022AdaSmooth: An Adaptive Learning Rate Method based on Effective RatioAdaSmooth
AEGDMAnnals of Applied Mathematics 2022An Adaptive Gradient Method with Energy and Momentumofficial
AmosarXiv 2022Amos: An Adam-style Optimizer with Adaptive Weight Decay towards Model-Oriented ScaleofficialAmos
GDA-AMICLR 2022GDA-AM: On the effectiveness of solving minimax optimization via Anderson Accelerationofficial
KOALAAAAI 2022KOALA: A Kalman Optimization Algorithm with Loss Adaptivityofficial
RotoGradICLR 2022RotoGrad: Gradient Homogenization in Multitask Learningofficial
SRSGDSIAM Journal on Imaging Sciences 2022Scheduled Restart Momentum for Accelerated Stochastic Gradient Descent
Step-Tuned SGDNeural Processing Letters 2022Second-order step-size tuning of SGD for non-convex optimization
AdaInjectIEEE TAI 2023AdaInject: Injection Based Adaptive Gradient Descent Optimizers for Convolutional Neural Networksofficial
AdaNormWACV 2023AdaNorm: Adaptive Gradient Norm Correction based Optimizer for CNNsofficialAdaNorm
AGDNeurIPS 2023AGD: an Auto-switchable Optimizer using Stepwise Gradient Difference for Preconditioning Matrix
AidaTMLR 2023A DNN Optimizer that Improves over AdaBelief by Suppression of the Adaptive Stepsize Rangeofficial
LionNeurIPS 2023Symbolic Discovery of Optimization AlgorithmsofficialLion
LookaroundNeurIPS 2023Lookaround Optimizer: k steps around, 1 step average
MultiAdamICML 2023MultiAdam: Parameter-wise Scale-invariant Optimizer for Multiscale Training of Physics-informed Neural Networks
RLEKFAAAI 2023RLEKF: An Optimizer for Deep Potential with Ab Initio Accuracy
Scheduled Weight Decay (SWD)NeurIPS 2023On the Overlooked Pitfalls of Weight Decay and How to Mitigate Them: A Gradient-Norm Perspectiveofficial
SGDFarXiv 2023Signal Processing Meets SGD: From Momentum to Filter
StableAdamWNeurIPS 2023Stable and low-precision training for large-scale vision-language modelscommunityStableAdamW
AdaActICDMW 2024An Adaptive Method Stabilizing Activations for Enhanced Generalization
Adam-atan2ICML 2024Scaling Exponents Across Parameterizations and OptimizerscommunityAdamAtan2
Adam-RelNeurIPS 2024Adam on Local Time: Addressing Nonstationarity in RL with Relative Adam Timesteps
AdEMAMixarXiv 2024The AdEMAMix Optimizer: Better, Faster, OlderofficialAdEMAMix
ADOPTNeurIPS 2024ADOPT: Modified Adam Can Converge with Any β₂ with the Optimal RateofficialADOPT
AGS-GDarXiv 2024Anisotropic Gaussian Smoothing for Gradient-based Optimization
BADMarXiv 2024BADM: Batch ADMM for Deep Learning
CaAdamarXiv 2024CaAdam: Improving Adam optimizer using connection aware methodsofficial
CAdamarXiv 2024CAdam: Confidence-Based Optimization for Online Learning
Cautious OptimizersarXiv 2024Cautious Optimizers: Improving Training with One Line of Codeofficial
EXAdamarXiv 2024EXAdam: The Power of Adaptive Cross-MomentsofficialEXAdam
FAdamarXiv 2024FAdam: Adam is a natural gradient optimizer using diagonal empirical Fisher informationcommunityFAdam
GrokAdamWAdamW variant with Grokfast-style gradient amplificationofficialGrokAdamW
GrokfastarXiv 2024Grokfast: Accelerated Grokking by Amplifying Slow Gradientsofficial
INNAproparXiv 2024A second-order-like optimizer with adaptive gradient scaling for deep learningofficial
KATENeurIPS 2024Remove that Square Root: A New Efficient Scale-Invariant Version of AdaGradofficial
MADAICML 2024MADA: Meta-Adaptive Optimizers through hyper-gradient Descent
RSGDMCCSB 2024Reducing Bias in Deep Learning Optimization: The RSGDM Approach
SET-AdamECML PKDD 2024On Suppressing Range of Adaptive Stepsizes of Adam to Improve Generalisation Performance
SNGMScience China Information Sciences 2024Stochastic Normalized Gradient Descent with Momentum for Large-Batch Training
SRMMJMLR 2024Stochastic regularized majorization-minimization with weakly convex and multi-convex surrogatesofficial
TAMarXiv 2024Torque-Aware Momentum
WarpAdamarXiv 2024WarpAdam: A new Adam optimizer based on Meta-Learning approach
AbsSADMMarXiv 2025Stochastic ADMM with batch size adaptation for nonconvex nonsmooth optimization
AdamCarXiv 2025Why Gradients Rapidly Increase Near the End of Training
AdamNXarXiv 2025AdamNX: An Adam improvement algorithm based on a novel exponential decay mechanism for the second-order moment estimateofficial
AdamSEMNLP 2025AdamS: Momentum Itself Can Be A Normalizer for LLM Pretraining and Post-training
adaNAPGarXiv 2025Boosting Accelerated Proximal Gradient Method with Adaptive Sampling for Stochastic Composite Optimization
AnoarXiv 2025ANO : Faster is Better in Noisy Landscapeofficial
BCOSarXiv 2025Stochastic Approximation with Block Coordinate Optimal Stepsizesofficial
Cautious Weight DecayarXiv 2025Cautious Weight Decaycommunity
CondaarXiv 2025Conda: Column-Normalized Adam for Training Large Language Models Fasterofficial
Coupled AdamACL 2025Better Embeddings with Coupled Adam
DecGDMachine Learning 2025A New Adaptive Gradient Method with Gradient Decomposition
DEOarXiv 2025Dimer-Enhanced Optimization: A First-Order Approach to Escaping Saddle Points in Neural Network Trainingofficial
EmoNaviAn emotion-driven optimizer that feels loss and navigates accordinglyofficial
MARSICML 2025MARS: Unleashing the Power of Variance Reduction for Training Large ModelsofficialMARS
FOCUSarXiv 2025FOCUS: First Order Concentrated Updating SchemeofficialFOCUS
FSGDMICLR 2025On the Performance Analysis of Momentum Method: A Frequency Domain Perspective
GramsICLR Workshop 2025Grams: Gradient Descent with Adaptive Momentum ScalingofficialGrams
HGMarXiv 2025Hindsight-Guided Momentum (HGM) Optimizer: An Approach to Adaptive Learning Rate
HVAdamAAAI 2025HVAdam: A Full-Dimension Adaptive Optimizer
KOarXiv 2025KO: Kinetics-inspired Neural Optimizer with PDE Simulation Approaches
KOALA++NeurIPS 2025KOALA++: Efficient Kalman-Based Optimization with Gradient-Covariance Products
Kourkoutas-BetaarXiv 2025Kourkoutas-Beta: A Sunspike-Driven Adam Optimizer with Desert FlairofficialKourkoutasSoftmaxFlex
MIAdamAAAI 2025A Method for Enhancing Generalization of Adam by Multiple Integrationsofficial
μ²-SGDICLR 2025Do Stochastic, Feel Noiseless: Stable Stochastic Optimization via a Double Momentum Mechanism
⊥Grad (OrthoGrad)ICLR 2025Grokking at the Edge of Numerical Stabilityofficial
OvershootarXiv 2025Overshoot: Taking advantage of future gradients in momentum-based stochastic optimizationofficial
PadamParXiv 2025Adaptive Moment Estimation Optimization Algorithm Using Projection Gradient for Deep Learning
Simplified-AdEMAMixarXiv 2025Connections between Schedule-Free Optimizers, AdEMAMix, and Accelerated SGD Variantsofficial
LyAmarXiv 2025LyAm: Robust Non-Convex Optimization for Stable Learning in Noisy Environments
NIRMALarXiv 2025Comparative Analysis of Novel NIRMAL Optimizer Against Adam and SGD with Momentum
SCSAdamWarXiv 2025Beyond First-Order: Training LLMs with Stochastic Conjugate Subgradients and AdamWofficial
SKA-SGDarXiv 2025Streaming Krylov-Accelerated Stochastic Gradient Descent
SoftSignSGD (S3)arXiv 2025SoftSignSGD(S3): An Enhanced Optimizer for Practical DNN Training and Loss Spikes Minimization Beyond Adam
SPAMarXiv 2025SPAM: Spike-Aware Adam with Momentum Reset for Stable LLM Trainingofficial
VSGDTMLR 2025Variational Stochastic Gradient Descent for Deep Neural Networksofficial
ZetAarXiv 2025ZetA: A Riemann Zeta-Scaled Extension of Adam for Deep Learning
AdaGCICML 2026AdaGC: Enhancing LLM Pretraining Stability via Adaptive Gradient ClippingAdaGC
AnonarXiv 2026Anon: Extrapolating Adaptivity Beyond SGD and Adam
C-AdamarXiv 2026A Theoretical and Experimental Study of a Novel Adaptive Learning Algorithm
DualAdamarXiv 2026Combining Adam and its Inverse Counterpart to Enhance Generalization of Deep Learning Optimizersofficial
FANoSarXiv 2026FANoS-v2: Feedback-Controlled Momentum with Thermostat Damping for Lightweight Neural Optimizationofficial
GradPowerICML 2026GradPower: Powering Gradients for Faster Language Model Pre-Training
HomeAdamarXiv 2026HomeAdam: Adam and AdamW Algorithms Sometimes Go Home to Obtain Better Provable Generalization
NOVAKarXiv 2026NOVAK: Unified adaptive optimizer for deep neural networks
PS-Clip-SGDarXiv 2026Robust and Fast Training via Per-Sample Clipping
SparseOptICML 2026SparseOpt: Addressing Normalization-induced Gradient Skew in Sparse Training
Stable-SPAM / GradientStabilizerICML 2026GradientStabilizer: Fix the Norm, Not the Gradientofficial
VRAdamICLR 2026A Physics-Inspired Optimizer: Velocity Regularized Adamofficial
SparseAdamAdam variant for sparse gradientsofficialSparseAdam
OptMuonarXiv 2026OptMuon: Closed-Loop Orthogonalized Momentum Methods for Stochastic Optimization with Zero-Noise Optimality
FOGOarXiv 2026FOGO: Forgetting-aware Orthogonalization Optimizer
AdamOICML 2026Preserving Plasticity in Continual Learning via Dynamical Isometry
MAdamarXiv 2026MAdam: Metric-Aware Multi-Objective Adam
MuConarXiv 2026MuCon: Clipped Muon Updates for LLM Training
NuMuonarXiv 2026NuMuon: Nuclear-Norm-Constrained Muon for Compressible LLM Training
MiMuonarXiv 2026MiMuon: Mixed Muon Optimizer with Improved Generalization for Large Models
PionarXiv preprint (cs.LG, stat.ML) 2026Pion: A Spectrum-Preserving Optimizer via Orthogonal Equivalence Transformationofficial
iMuon (Intrinsic Muon)arXiv 2026Intrinsic Muon: Spectral Optimization on Riemannian Matrix Manifoldsofficial
Muon-OGDarXiv 2026Muon-OGD: Muon-based Spectral Orthogonal Gradient Projection for LLM Continual Learning
Newton-MuonarXiv 2026The Newton-Muon Optimizerofficial
MuonEqarXiv 2026MuonEq: Balancing Before Orthogonalization with Lightweight Equilibrationofficial
RMNParXiv 2026RMNP: Row-Momentum Normalized Preconditioning for Scalable Matrix-Based Optimizationofficial
MUDarXiv preprint 2026Beyond Muon: MUD (MomentUm Decorrelation) for Faster Transformer Training
NAMOarXiv 2026Adam Improves Muon: Adaptive Moment Estimation with Orthogonalized Momentumofficial
SpecMuonarXiv 2026Muon with Spectral Guidance: Efficient Optimization for Scientific Machine Learning
AROarXiv 2026ARO: A New Lens On Matrix Optimization For Large Models
PRISMarXiv 2026PRISM: Structured Optimization via Anisotropic Spectral Shaping
MCSD / SPELarXiv 2026Manifold constrained steepest descent
Variance-Adaptive Muon (Muon-NSR / Muon-VS)arXiv 2026Variance-Adaptive Muon: Accelerating LLM Pretraining with NSR-Modulated and Variance-Scaled Momentum
MuonAllarXiv 2025MuonAll: Muon Variant for Efficient Finetuning of Large Language Modelsofficial
GluonarXiv 2025 (also accepted at ICML 2025 HiLD workshop)Gluon: Making Muon & Scion Great Again! (Bridging Theory and Practice of LMO-based Optimizers for LLMs)
LPSGD / LPSGDMarXiv 2026Beyond L2-norm and L-infinity-norm: A Curvature-Inspired ell_p-Norm Scheme for Deep Neural Networks
ABSignSGDICLR 2026Arbitrary-Order Block SignSGD for Memory-Efficient LLM Fine-Tuning
StoSignSGDarXiv 2026StoSignSGD: Unbiased Structural Stochasticity Fixes SignSGD for Training Large Language Models
Hybrid SignSGD-SGD switchingarXiv 2026Enhancing SignSGD: Small-Batch Convergence Analysis and a Hybrid Switching Strategy
SoftSignum / SoftMuonICML 2026Softsign: Smooth Sign in Your Optimizer For Better Parameter Heterogeneity Handlingofficial
Accelerated SignGDarXiv 2025Norm-Constrained Flows and Sign-Based Optimization: Theory and Algorithms
CLionarXiv 2026CLion: Efficient Cautious Lion Optimizer with Enhanced Generalization
OLionarXiv 2026OLion: Approaching the Hadamard Ideal by Intersecting Spectral and ell_{infty} Implicit Biasesofficial
MGUPNeurIPS 2025MGUP: A Momentum-Gradient Alignment Update Policy for Stochastic Optimizationofficial
MagmaarXiv 2026On Surprising Effectiveness of Masking Updates in Adaptive Optimizers
AGGCACL 2026AGGC: Adaptive Group Gradient Clipping for Stabilizing Large Language Model Trainingofficial
Clipped ScionNeurIPS 2025Generalized Gradient Norm Clipping & Non-Euclidean (L_0,L_1)-Smoothnessofficial
SPECTRAICML 2026Enhancing LLM Training via Spectral Clippingofficial
Spectral Clipping (matrix-valued)arXiv 2026Gradient Clipping Beyond Vector Norms: A Spectral Approach for Matrix-Valued Parameters
SPAMPACM Multimedia Asia 2025 (7th ACM International Conference on Multimedia in Asia)Gradient Shaping Beyond Clipping: A Functional Perspective on Update Magnitude Control
NucGDarXiv 2026Towards The Implicit Bias on Multiclass Separable Data Under Norm Constraintsofficial
Batched / Transported ScionarXiv 2026Scale-Invariant Neural Network Optimization: Norm Geometry and Heavy-Tailed Noise
EMA bias-corrected iterate averagingNeurIPS 2025 Workshop (OPT 2025)EMA Without the Lag: Bias-Corrected Iterate Averaging Schemes
RGrad-AvgOPT 2025 (17th Annual Workshop on Optimization for Machine Learning, co-located with NeurIPS 2025)On Riemannian Gradient Descent Algorithm using gradient averaging
SGD with adaptive preconditioningICLR 2026SGD with Adaptive Preconditioning: Unified Analysis and Momentum Acceleration
HTMuonarXiv 2026HTMuon: Improving Muon via Heavy-Tailed Spectral Correctionofficial
MARS-MarXiv 2025MARS-M: When Variance Reduction Meets Matricesofficial
Drop-MuonarXiv 2025Drop-Muon: Update Less, Converge Faster
Muon+arXiv 2026MUON+: Towards More Effective Muon via One Additional Normalization Step for LLM Pre-trainingofficial
TrasMuonICLR 2026 Workshop Sci4DLTrasMuon: Trust-Region Adaptive Scaling for Orthogonalized Momentum Optimizers
Adam-SHANGarXiv 2026Adam-SHANG: A Convergent Adam-Type Method for Stochastic Smooth Convex Optimization
EMA-NesterovarXiv 2026EMA-Nesterov: Stabilizing Nesterov's Lookahead for Accelerated Deep Learning Optimization
S-AdamarXiv 2026Singularity-aware Optimization via Randomized Geometric Probing: Towards Stable Non-smooth Optimization
IAdaPID-ADGarXiv 2026An Improved Adaptive PID Optimizer with Enhanced Convergence and Stability for Deep Learning
CT-AGDarXiv 2026Accelerated Gradient Descent for Faster Convergence with Minimal Overhead
GPA (Generalized Primal Averaging)arXiv 2025Smoothing DiLoCo with Primal Averaging for Faster Training of LLMsofficial
SNOOarXiv 2025SNOO: Step-K Nesterov Outer Optimizer - The Surprising Effectiveness of Nesterov Momentum Applied to Pseudo-Gradientsofficial
RiemannionICLR 2026LoRA meets Riemannion: Muon Optimizer for Parametrization-independent Low-Rank Adapters
Optimal Projection-Free Adaptive SGDarXiv 2026Optimal Projection-Free Adaptive SGD for Matrix Optimization
AdamCBICLR 2025ADAM Optimization with Adaptive Batch Selection
Kalman-AdamKnowledge-Based Systems 2026Kalman-Adam: Optimal bayesian moment estimation for memory-Efficient and generalizable deep learning
AdamHD (AdamHuberDecay)NeurIPS 2025 Workshop (ScaleOpt: GPU-Accelerated and Scalable Optimization)AdamHD: Decoupled Huber Decay Regularization for Language Model Pre-Training
MVN-GradarXiv 2026Adaptive Optimization via Momentum on Variance-Normalized Gradients
Compositional Muon (CM)Tilde Research blog 2026Towards Compositional Steepest Descentofficial

Memory-Efficient Optimizers

Memory-efficient optimizers reduce the optimizer-state memory that dominates large-model training budgets, where Adam-style methods store two extra full-precision values per parameter. The methods below cover factored second moments, 8-bit and 4-bit state quantization, low-rank gradient projection, block-coordinate updates, zeroth-order gradient estimates, and stateless update rules.

OptimizerVenuePaperCodezij
AdafactorICML 2018Adafactor: Adaptive Learning Rates with Sublinear Memory CostofficialAdafactor
SM3NeurIPS 2019Memory-Efficient Adaptive OptimizationofficialSM3
8-bit OptimizersICLR 20228-bit Optimizers via Block-wise Quantizationofficial
tpSGDarXiv 2022Learning with Local Gradients at the Edge
4-bit OptimizersNeurIPS 2023Memory Efficient Optimizers with 4-bit Statesofficial
AdaliteGitHub 2023Adalite: a custom optimizer based on Adafactor and LAMBofficial
AdaLomoACL 2024 FindingsAdaLomo: Low-memory Optimization with Adaptive Learning RateofficialAdaLomo
CAMEACL 2023CAME: Confidence-guided Adaptive Memory Efficient OptimizationofficialCAME
LionNeurIPS 2023Symbolic Discovery of Optimization Algorithmsofficial
LOMOACL 2024Full Parameter Fine-tuning for Large Language Models with Limited ResourcesofficialLomo
MeZONeurIPS 2023Fine-Tuning Language Models with Just Forward Passesofficial
TigerGitHub 2023Tiger: A Tight-fisted OptimizerofficialTiger
4-bit ShampooNeurIPS 20244-bit Shampoo for Memory-Efficient Network Trainingofficial
Adam-miniICLR 2025Adam-mini: Use Fewer Learning Rates To Gain MoreofficialAdamMini
AdapproxarXiv 2024Adapprox: Adaptive Approximation in Adam Optimization via Randomized Low-Rank Matrices
AdaRankGradICLR 2025AdaRankGrad: Adaptive Gradient-Rank and Moments for Memory-Efficient LLMs Training and Fine-Tuning
AddaxICLR 2025Addax: Utilizing Zeroth-Order Gradients to Improve Memory Efficiency and Performance of SGD for Fine-Tuning Language Modelsofficial
APOLLOMLSys 2025APOLLO: SGD-like Memory, AdamW-level PerformanceofficialAPOLLO
BAdamNeurIPS 2024BAdam: A Memory Efficient Full Parameter Optimization Method for Large Language ModelsofficialBlockOptimizer
COAPCVPR 2025COAP: Memory-Efficient Training with Correlation-Aware Gradient Projectionofficial
FiraNeurIPS 2025Fira: Can We Achieve Full-rank Training of LLMs Under Low-rank Constraint?officialFiraAdamW
FloraICML 2024Flora: Low-Rank Adapters Are Secretly Gradient Compressorsofficial
FRUGALICML 2025FRUGAL: Memory-Efficient Optimization by Reducing State Overhead for Scalable Trainingofficial
GaLoreICML 2024GaLore: Memory-Efficient LLM Training by Gradient Low-Rank ProjectionofficialGaLoreAdamW
GoLoreICML 2025Subspace Optimization for Large Language Models with Convergence Guaranteesofficial
GRASSEMNLP 2024Grass: Compute Efficient Low-Memory LLM Training with Structured Sparse Gradientsofficial
LDAdamICLR 2025LDAdam: Adaptive Optimization from Low-Dimensional Gradient StatisticsofficialLDAdamW
LoQTNeurIPS 2024LoQT: Low-Rank Adapters for Quantized Pretrainingofficial
LoRA-RITEICLR 2025LoRA Done RITE: Robust Invariant Transformation Equilibration for LoRA Optimizationofficial
MicroAdamNeurIPS 2024MicroAdam: Accurate Adaptive Optimization with Low Space Overhead and Provable Convergenceofficial
MuonBlog 2024Muon: An optimizer for hidden layers in neural networksofficialMuon
Online Subspace DescentNeurIPS 2024Memory-Efficient LLM Training with Online Subspace Descentofficial
Q-GaLoreCPAL 2025Q-GaLore: Quantized GaLore with INT4 Projection and Layer-Adaptive Low-Rank Gradientsofficial
SGD-SaIarXiv 2024No More Adam: Learning Rate Scaling at Initialization is All You NeedofficialSGDSaI
SMMFAAAI 2025SMMF: Square-Matricized Momentum Factorization for Memory-Efficient Optimizationofficial
SNSMICML 2025Lean and Mean Adaptive Optimization via Subset-Norm and Subspace-Momentum with Convergence Guaranteesofficial
SWANICML 2025SWAN: SGD with Normalization and Whitening Enables Stateless LLM Training
AlphaGradarXiv 2025AlphaGrad: Non-Linear Gradient Normalization Optimizer
GWTarXiv 2025GWT: Scalable Optimizer State Compression for Large Language Model Training
MLorcAISTATS 2026MLorc: Momentum Low-rank Compression for Memory Efficient Large Language Model Adaptationofficial
MoFaSGDTMLR 2025Low-rank Momentum Factorization for Memory Efficient Trainingofficial
RACS / AlicearXiv 2025Towards Efficient Optimizer Design for LLM via Structured Fisher Approximation with a Low-Rank Extensioncommunity
SinkGDarXiv 2025Gradient Multi-Normalization for Stateless and Scalable LLM Training
SPAMICLR 2025SPAM: Spike-Aware Adam with Momentum Reset for Stable LLM TrainingofficialSPAM
SubTrack++NeurIPS 2025SubTrack++ : Gradient Subspace Tracking for Scalable LLM Trainingofficial
SUMONeurIPS 2025SUMO: Subspace-Aware Moment-Orthogonalization for Accelerating Memory-Efficient LLM Training
TensorGRaDarXiv 2025TensorGRaD: Tensor Gradient Robust Decomposition for Memory-Efficient Neural Operator Training
FlashOptimarXiv 2026FlashOptim: Optimizers for Memory-Efficient Trainingofficial
RoseGitHub 2026Rose: Range-Of-Slice Equilibration optimizerofficial
SAGEACL 2026 FindingsSAGE: Sign-Adaptive Gradient for Memory-Efficient LLM Optimization
BlockLLMarXiv 2024BlockLLM: Memory-Efficient Adaptation of LLMs by Selecting and Optimizing the Right Coordinate Blocksofficial
Natural GaLorearXiv 2024Natural GaLore: Accelerating GaLore for memory-efficient LLM Training and Fine-tuningofficial
SLTrainNeurIPS 2024SLTrain: a sparse plus low-rank approach for parameter and memory efficient pretrainingofficial
8-bit MuonarXiv 2025Effective Quantization of Muon Optimizer States
FFT-based Subspace SelectionICLR 2026FFT-based Dynamic Subspace Selection for Low-Rank Adaptive Optimization of Large Language Modelsofficial
FOAMarXiv 2025FOAM: Blocked State Folding for Memory-Efficient LLM Trainingofficial
GaLore 2arXiv 2025GaLore 2: Large-Scale LLM Pre-Training by Gradient Low-Rank Projection
GradientStabilizerICML 2026GradientStabilizer: Fix the Norm, Not the Gradientofficial
GUMarXiv 2025Unbiased Gradient Low-Rank Projection
I3SNeurIPS 2025Breaking the Frozen Subspace: Importance Sampling for Low-Rank Optimization in LLM Pretraining
LORENZATMLR 2026LORENZA: Enhancing Generalization in Low-Rank Gradient LLM Training via Efficient Zeroth-Order Adaptive SAM
ProjFactor (VLoRP)arXiv 2025Memory-Efficient LLM Training by Various-Grained Low-Rank Projection of Gradients
RSOarXiv 2025A Memory Efficient Randomized Subspace Optimization Method for Training Large Language Models
SCALEICML 2026Memory-Efficient LLM Pretraining via Minimalist Optimizer Design
SlimAdamarXiv 2025When Can You Get Away with Low Memory Adam?official
LoRA-PreICLR 2026Taming Momentum: Rethinking Optimizer States Through Low-Rank Approximationofficial
LotusarXiv 2026Lotus: Efficient LLM Training by Randomized Low-Rank Gradient Projection with Adaptive Subspace Switching
POET-XICML 2026POET-X: Memory-efficient LLM Training by Scaling Orthogonal Transformationofficial
MuonQarXiv 2026MuonQ: Enhancing Low-Bit Muon Quantization via Directional Fidelity Optimizationofficial
4-bit-Muon-GRASPICLR 2026Achieving low-bit Muon through subspace preservation and grid quantizationofficial
IO-AdamOpenReview 2026IO-Adam: Rethinking Memory-Efficient Adaptive Optimizers from Gradient Computation
H-FacAISTATS 2025Memory-Efficient Optimization with Factorized Hamiltonian Descent
LiMuonICML 2026LiMuon: Light and Fast Muon Optimizer for Large Models
M+AdamOPT 2025: 17th Annual Workshop on Optimization for Machine Learning (NeurIPS 2025 Workshop)M+Adam: Stable Low-Precision Training with Combined Adam–Madam Updates
SMETICML 2026Memory-Efficient LLM Training with Dynamic Sparsity: From Stability to Practical Scalingofficial
PowerSteparXiv 2026PowerStep: Memory-Efficient Adaptive Optimization via ell_p-Norm Steepest Descentofficial
SRONOpenReview 2025SRON: State-free LLM Training via Row-wise Gradient Normalization
GradLitearXiv 2025Backward-Friendly Optimization: Training Large Language Models with Approximate Gradients under Memory Constraints
Optimal Low-Rank SGEarXiv preprint 2026Optimal low-rank stochastic gradient estimation for LLM training
Spectral Compact Training (SCT)arXiv 2026Spectral Compact Training: Pre-Training Large Language Models via Permanent Truncated SVD and Stiefel QR Retractionofficial

Trainer integrations

HuggingFace transformers exposes many of these methods through the optim argument of TrainingArguments. Each string value below maps to a memory-efficient optimizer; all backing libraries except the built-in Adafactor must be installed separately.

optim valueBacking library
adafactortransformers ships its own Adafactor implementation with relative-step and update-clipping options (Apache-2.0).
adamw_bnb_8bit / adamw_8bitbitsandbytes AdamW with block-wise 8-bit quantized state (MIT).
paged_adamw_8bit / paged_adamw_32bitbitsandbytes paged AdamW; optimizer state is paged between GPU and CPU memory (MIT).
lion_8bit / lion_32bit / paged_lion_8bit / paged_lion_32bitbitsandbytes Lion, single momentum buffer, with 8-bit and paged variants (MIT).
ademamix_8bit / paged_ademamix_8bit / paged_ademamix_32bitbitsandbytes AdEMAMix with 8-bit quantized and paged state (MIT).
rmsprop_bnb_8bitbitsandbytes RMSprop with block-wise 8-bit quantized state (MIT).
adamw_torch_4bit / adamw_torch_8bittorchao pure-PyTorch AdamW with 4-bit or 8-bit optimizer states (BSD-3-Clause).
galore_adamw / galore_adamw_8bit / galore_adafactor and *_layerwise variantsgalore-torch, the official GaLore release (Apache-2.0).
apollo_adamw / apollo_adamw_layerwiseapollo-torch, the official APOLLO release (CC-BY-NC-4.0).
lomo / adalomolomo-optim, the official LOMO and AdaLomo release (MIT).

Fractional-Order Optimizers

Fractional-order optimizers generalize the integer-order gradient step with fractional-calculus operators, most commonly the Caputo, Riemann-Liouville, or Grünwald-Letnikov derivative, which weight past gradient information through power-law memory kernels. The field is young: the first neural-network training results date to 2015, convergence theory is still being settled, and most papers ship no code.

Foundations

OptimizerVenuePaperCode
the Fractional Steepest Descent Method (FSDM)IEEE Transactions on Neural Networks and Learning Systems 2015Fractional Extreme Value Adaptive Training Method: Fractional Steepest Descent Approach
Caputo BP-NN FOGD (ISNN)Advances in Neural Networks - ISNN 2017 (Lecture Notes in Computer Science)A Caputo-Type Fractional-Order Gradient Descent Learning of BP Neural Networks
Caputo CVNN FOGDIEEE Access 2017Convergence Analysis of Caputo-Type Fractional Order Complex-Valued Neural Networks
Caputo fractional-order gradient descentNeural Networks 2017Fractional-order gradient descent learning of BP neural networks with Caputo derivative
FBPTTCircuits, Systems, and Signal Processing 2018A Novel Fractional Gradient-Based Learning Algorithm for Recurrent Neural Networks
FGD-RBFCircuits, Systems, and Signal Processing 2018A Fractional Gradient Descent-Based RBF Neural Network
Fractional-Order Deep BP NNComputational Intelligence and Neuroscience 2018Fractional-Order Deep Backpropagation Neural Networkofficial
Caputo-Type FOGD (Deep BP)IEEE IMCEC 2019A Caputo-Type Fractional-Order Gradient Descent Learning of Deep BP Neural Networks
FSGDElectronic Markets 2019Fractional stochastic gradient descent for recommender systems
mF-SGDIEEE Access 2019Design of Momentum Fractional Stochastic Gradient Descent for Recommender Systems
CFEM-LMSNeurocomputing 2020Combination of fractional FLANN filters for solving the Van der Pol-Duffing oscillator
FSDMFrontiers of Information Technology & Electronic Engineering 2020Fractional-order global optimal backpropagation machine trained by an improved fractional-order steepest descent method
Fractional Order Gradient MethodNeurocomputing 2020Convolutional neural networks with fractional order gradient method
Normalized Fractional SGD (NFSGD)Neural Computing and Applications 2020Design of normalized fractional SGD computing paradigm for recommender systems
the Fractional Order Gradient MethodJournal of the Franklin Institute 2020Generalization of the gradient method with fractional order gradient direction
Fractional Order Gradient Descent with Momentum (FOGDM)Network: Computation in Neural Systems 2020Data classification based on fractional order gradient descent with momentum for RBF neural network
CFGD (Caputo)arXiv 2021A Caputo fractional derivative-based algorithm for optimization
Fractional-Order Momentum (FCM)Neurocomputing 2021Convolutional neural networks based on fractional-order momentum for parameter training
FOGDM-RBFSoft Computing 2021Fractional-order gradient descent with momentum for RBF neural network-based AIS trajectory restoration
CaputronElectronics (MDPI) 2022Exploring the Effects of Caputo Fractional Derivative in Spiking Neural Network Trainingofficial
FGD (CNN BP)arXiv 2022Using a novel fractional-order gradient method for CNN back-propagation
FGNNMathematics (MDPI) 2022A Regularized Graph Neural Network Based on Approximate Fractional Order Gradients
FracMNeural Computing and Applications 2022A fractional-order momentum optimization approach of deep neural networkscommunity
GFSGDChaos, Solitons & Fractals 2022Generalized fractional strategy for recommender systems with chaotic ratings behavior
Fractional Derivative Gradient Optimizers (FSGDApplied Sciences 2022Fractional Derivative Gradient-Based Optimizers for Neural Networks and Human Activity Recognition
Fractional LMS (FLMS)IEEE Transactions on Signal Processing 2022Performance Analysis of Fractional Learning Algorithms
Conformable Fractional Gradient DescentFuzzy Systems and Data Mining VIII 2022Fractional Gradient Descent Learning of Backpropagation Artificial Neural Networks with Conformable Fractional Calculus
Fractional Order Gradient Descent with variable initial valueNeurocomputing 2022Study on fast speed fractional order gradient descent method and its application in neural networks
TFGD (Time-fractional)Axioms 2022Training Neural Networks by Time-Fractional Gradient Descent
Variable Order Fractional Gradient DescentChinese Control and Decision Conference 2022Variable Order Fractional Gradient Descent Method and Its Application in Neural Networks Optimization
CfGD / CfAdamNeural Networks 2023Accelerating gradient descent and Adam via fractional gradients
RFGDNeural Networks 2023A fractional gradient descent algorithm robust to the initial weights of multilayer perceptron
FO-RI-FedAvgarXiv 2026Fractional Order Federated Learning for Battery Electric Vehicle Energy Consumption Modeling
IHL-AdamExpert Systems with Applications 2024Parameter training method for convolutional neural networks based on improved Hausdorff-like derivative

Recent advances

OptimizerVenuePaperCode
AFOGD / AFOAGDarXiv 2023The Novel Adaptive Fractional Order Gradient Decent Algorithms Design via Robust Control
EFSGD / EN-EFSGDChaos, Solitons & Fractals 2023Enhanced fractional prediction scheme for effective matrix factorization in chaotic feedback recommender systems
FCGD_G-LMathematics 2023A Deep Learning Optimizer Based on Grünwald–Letnikov Fractional Order Definition
FGDAMApplied Mathematics and Computation 2023Applications of fractional gradient descent method with adaptive momentum in BP neural networks
FracGChinese Control Conference (CCC) 2023Optimization Method of Neural Networks via Fractional-Order of Gradients
Fractional Gradient Descent (FSGD)Fractal and Fractional 2023Fractional Gradient Optimizers for PyTorch: Enhancing GAN and BERT
the Improved Stochastic Fractional Order Gradient Descent algorithmFractal and Fractional 2023The Improved Stochastic Fractional Order Gradient Descent Algorithm
AdaGLNeural Processing Letters 2024An Adaptive Learning Rate Deep Learning Optimizer Using Long and Short-Term Gradients Based on G–L Fractional-Order Derivativecommunity
GFSGDHeliyon 2024Fractional gradient optimized explainable convolutional neural network for Alzheimer's disease diagnosis
FOAdamApplied Mathematical Modelling 2024A novel gradient descent optimizer based on fractional order scheduler and its application in deep neural networks
Adaptive Terminal Caputo Fractional Gradient Descent (AT-CFGD)TMLR 2024Convergence Analysis of Fractional Gradient Descent
Caputo Fractional-Order Gradient DescentInternational Journal of Fuzzy Systems 2024A Novel Neuro-fuzzy Learning Algorithm for First-Order Takagi–Sugeno Fuzzy Model: Caputo Fractional-Order Gradient Descent Method
FNGDIEEE Access 2024Improving the Accuracy of Neural Network Pattern Recognition by Fractional Gradient Descent
MFFGDNeurocomputing 2024MFFGD: An adaptive Caputo fractional-order gradient algorithm for DNN
Caputo-based SGD (L1 scheme)OpenReview 2024Stochastic Fractional Gradient Descent with Caputo L1 Scheme for Deep Neural Networks
C-FOGFractal and Fractional 2024Self-Organizing Optimization Based on Caputo's Fractional Order Gradients
CSA-CFGDPeerJ Computer Science 2024Deep ocular tumor classification model using cuckoo search algorithm and Caputo fractional gradient descentofficial
FGD-RBFNN (UAV)Computer Modeling in Engineering & Sciences 2024Fractional Gradient Descent RBFNN for Active Fault-Tolerant Control of Plant Protection UAVs
FOELMApplied Soft Computing 2024An interval neural network-based Caputo fractional-order extreme learning machine applied to classification
MIFAlgorithms 2024An Integer-Fractional Gradient Algorithm for Back Propagation Neural Networks
Multi-layer NN FOGDAdvanced Theory and Simulations 2024Convergence Analysis and Application for Multi-Layer Neural Network Based on Fractional-Order Gradient Descent Learning
UCAdamJournal of Electrical Systems 2024Improved Adam: Incorporating Unified Conformable Fractional Derivative for fractional-order Momentum
2SEDFOSGDarXiv 2025Effective Dimension Aware Fractional-Order Stochastic Gradient Descent for Convex Optimization Problems
2SEDFOSGDarXiv 2025More Optimal Fractional-Order Stochastic Gradient Descent for Non-Convex Optimization Problems
AFGD (adaptive Caputo FGD for TCN)Neurocomputing 2025Monotonic convergence of adaptive Caputo fractional gradient descent for temporal convolutional networks
FGDSINNInternational Journal of Machine Learning and Cybernetics 2025A smoothing interval neural networks-based Caputo fractional-order gradient learning algorithm
FOSGD / FOSGDM / FOSGDMENeural Networks 2025Fractional-order stochastic gradient descent method with momentum and energy for deep neural networks
FracGradFractal and Fractional 2025FracGrad: A Discretized Riemann–Liouville Fractional Integral Approach to Gradient Accumulation for Deep Learning
GF-SGDComputers in Biology and Medicine 2025Generalized fractional optimization-based explainable lightweight CNN model for malaria disease classification
IFOGDNeural Networks 2025Improved fractional-order gradient descent method based on multilayer perceptron
L2O-CFGDarXiv 2025Enhancing Fractional Gradient Descent with Learned Optimizersofficial
MOAOCFGDarXiv 2025An Adaptive Order Caputo Fractional Gradient Descent Method for Multi-objective Optimization Problems
NCFDD / NFLightGBMInformation Fusion 2025Fractional light gradient boosting machine ensemble learning model: A non-causal fractional difference descent approach
a Caputo fractional-order gradient descent for neural network trainingChaos, Solitons & Fractals 2025Fractional-order gradient approach for optimizing neural networks: A theoretical and empirical analysis
Fractional-order SGD (FSGD)arXiv 2025Fractional-order Jacobian Matrix Differentiation and Its Application in Artificial Neural Networks
Adaptive Parameter Fractional-Order Gradient Descent LearningEuropean Journal of Operational Research 2025Novel adaptive parameter fractional-order gradient descent learning for stock selection decision support systems
FAdamChaos, Solitons & Fractals 2025Parameter training methods for convolutional neural networks with adaptive adjustment method based on Caputo fractional-order differences
SFMDigital Signal Processing 2025A momentum-based stochastic fractional gradient optimizer with U-net model for brain tumor segmentation in MRI
Caputo Fractional-order Gradient Descent for Ridge Polynomial NeuralInternational Conference on Electronics and Communication, Network and Computer Technology 2025A Novel Method for Ridge Polynomial Neural Network-based Caputo Fractional-order Gradient Descent Algorithm
AOFGDSSRN 2025AOFGD: Adaptive order fractional gradient descent method
Frac-AdamMathematics 2025Fractional Optimizers for LSTM Networks in Financial Time Series Forecasting
Caputo Fractional Gradient DescentInternational Conference on Advanced Algorithms and Control Engineering 2025Fractional Order Gradient Descent with Caputo Derivatives for Product-Unit Neural Networks
FO-STDGDNeurocomputing 2025Fractional-order spike-timing-dependent gradient descent for multi-layer spiking neural networks
Fractional Order Stochastic Gradient Descent (FOSGD)ASME IDETC-CIE 2025Tail-Index-Awareness in Fractional Order Stochastic Gradient Descent
λ-FAdaMaxExpert Systems with Applications 2025λ-FAdaMax: A novel fractional-order gradient descent method with decaying second moment for neural network training
CFDNNScientific Reports 2026Conformable Fractional Deep Neural Networks (CFDNN) for high-speed cyber-attack detection
CFGD (Compressed)IEEE Transactions on Neural Networks and Learning Systems 2026Fractional Gradient Descent With Matrix Stepsizes for Non-Convex Optimizationofficial
FAdamWavFractal and Fractional 2026FAdamWav: A Fractional Wavelet Gradient Optimizer for Neural Networks
FOFedAvgarXiv 2026Fractional-Order Federated Learning
Fractional-order FL with adaptive momentumIEEE Transactions on Emerging Topics in Computational Intelligence 2026Communication-Efficient Federated Learning via Fractional-Order Gradient Descent With Adaptive Momentum Under Non-IID Data
TFGD (Tempered)Neural Networks 2026Tempered fractional gradient descent: Theory, algorithms, and robust learning applications
FGD-EDInformation Processing & Management 2026Fractional-order gradient descent method based on fractional-order term exponential decay and its application in artificial neural networks
the Caputo Fractional-Order Gradient Descent Method (FGDM)Applied Soft Computing 2026A novel gradient learning algorithm based on zero-order Takagi-Sugeno fuzzy model: the caputo fractional-order gradient descent
CFGD (Conformable)Journal of Computational and Applied Mathematics 2026Conformable fractional gradient descent: A local optimizer for neural network training
NGLFGDKnowledge-Based Systems 2026Fast and accurate fractional order gradient descent algorithm and its application in Extreme Gradient Boosting
FO-ElmanNeural Networks 2026Fractional-order gradient descent learning for Elman neural networks

Surveys

OptimizerVenuePaperCode
Fractional-Order Gradient Descent for Neural NetworksThe European Physical Journal Special Topics 2022Artificial neural networks: a practical review of applications involving fractional calculus
Fractional Gradient Descent (FGD)Chaos, Solitons & Fractals 2025A comprehensive survey of fractional gradient descent methods and their convergence analysis
the Fractional Continuous Time Method (FCTM)Journal of Computational and Applied Mathematics 2026An overview of the fractional-order gradient descent method and its applications

Note: FAdam (arXiv 2405.12807) is a Fisher-information variant of Adam and is unrelated to fractional calculus despite the name.

Distributed and Communication-Efficient Optimizers

Optimizers in this category target training across many devices or nodes, where memory and inter-worker communication are the main bottlenecks. They shard optimizer state, compress gradient exchange, or synchronize infrequently so that training scales without a proportional increase in bandwidth. Some entries are standalone update rules, while others wrap an inner optimizer with a communication-efficient outer loop.

OptimizerVenuePaperCodezij
signSGDICML 2018signSGD: Compressed Optimisation for Non-Convex Problemsofficial
LD-SGDarXiv 2019Communication-Efficient Local Decentralized SGD Methods
Local SGDICLR 2019Local SGD Converges Fast and Communicates Littlecommunity
PowerSGDNeurIPS 2019PowerSGD: Practical Low-Rank Gradient Compression for Distributed Optimization
Qsparse-local-SGDNeurIPS 2019Qsparse-local-SGD: Distributed SGD with Quantization, Sparsification, and Local Computations
signProxICASSP 2019signProx: One-Bit Proximal Algorithm for Nonconvex Stochastic Optimization
APMSqueezearXiv 2020APMSqueeze: A Communication Efficient Adam-Preconditioned Momentum SGD Algorithm
DEED-GDarXiv 2020DEED: A General Quantization Scheme for Communication Efficiency in Bits
FedACNeurIPS 2020Federated Accelerated Stochastic Gradient Descent
LAGS-SGDECAI 2020Layer-wise Adaptive Gradient Sparsification for Distributed Deep Learning with Convergence Guarantees
rTop-kJSAIT 2020rTop-k: A Statistical Estimation Approach to Distributed SGD
SCAFFOLDICML 2020SCAFFOLD: Stochastic Controlled Averaging for Federated Learning
SlowMoICLR 2020SlowMo: Improving Communication-Efficient Distributed SGD with Slow Momentum
ZeROSC 2020ZeRO: Memory Optimizations Toward Training Trillion Parameter Modelsofficial
1-bit AdamICML 20211-bit Adam: Communication Efficient Large-Scale Training with Adam's Convergence Speedofficial
BVR-L-SGDICML 2021Bias-Variance Reduced Local SGD for Less Heterogeneous Federated Learning
SQuARM-SGDJSAIT 2021SQuARM-SGD: Communication-Efficient Momentum SGD for Decentralized Optimization
SketchedAMSGradICDM 2022Communication-Efficient Adam-Type Algorithms for Distributed Data Mining
0/1 AdamICLR 2023Maximizing Communication Efficiency for Large-scale Training via 0/1 Adamofficial
AdaCGDTMLR 2023Adaptive Compression for Communication-Efficient Distributed Training
DiLoCoarXiv 2023DiLoCo: Distributed Low-Communication Training of Language Modelscommunity
Distributed ShampooarXiv 2023A Distributed Data-Parallel PyTorch Implementation of the Distributed Shampoo Optimizer for Training Neural Networks At-Scaleofficial
SPARQ-SGDTAC 2023SPARQ-SGD: Event-Triggered and Compressed Communication in Decentralized Stochastic Optimization
AdaFedAdamTMLCN 2024Accelerating Fair Federated Learning: Adaptive Federated Adamofficial
DeMoarXiv 2024DeMo: Decoupled Momentum Optimizationofficial
FADASICML 2024FADAS: Towards Federated Adaptive Asynchronous Optimizationofficial
FAGHarXiv 2024FAGH: Accelerating Federated Learning with Approximated Global Hessian
Fed-SophiaICC 2024Fed-Sophia: A Communication-Efficient Second-Order Federated Learning Algorithm
FedLionICASSP 2024FedLion: Faster Adaptive Federated Optimization with Fewer Communicationofficial
FedRepOptACCV 2024FedRepOpt: Gradient Re-parametrized Optimizers in Federated Learningofficial
FedSTaSarXiv 2024FedSTaS: Client Stratification and Client Level Sampling for Efficient Federated Learningofficial
FESS-GDAAISTATS 2024Stochastic Smoothed Gradient Descent Ascent for Federated Minimax Optimization
FLeNSBigData 2024FLeNS: Federated Learning with Enhanced Nesterov-Newton Sketchofficial
MM-PSGD / MC-PSGDMMAsia-W 2024Distributed Optimization over Block-Cyclic Data
OpenDiLoCoarXiv 2024OpenDiLoCo: An Open-Source Framework for Globally Distributed Low-Communication Trainingofficial
ADEFarXiv 2025Accelerated Distributed Optimization with Compression and Error Feedback
DAT-SGDICML 2025Enhancing Parallelism in Decentralized Stochastic Convex Optimization
DeCo-SGDarXiv 2025Taming Latency and Bandwidth: A Theoretical Framework and Adaptive Algorithm for Communication-Constrained Training
DES-LOCarXiv 2025DES-LOC: Desynced Low Communication Adaptive Optimizers for Training Foundation Models
DionarXiv 2025Dion: Distributed Orthonormalized Updatesofficial
DLAS-R-FTCCDC 2025Distributed Optimization and Learning for Automated Stepsize Selection with Finite Time Coordination
FAdamGCarXiv 2025Gradient Correction in Federated Learning with Adaptive Optimization
FedCETarXiv 2025Communication Efficient Federated Learning with Linear Convergence on Heterogeneous Data
FedIvonTMLR 2025Federated Learning with Uncertainty and Personalization via Efficient Second-order Optimization
FedMuonarXiv 2025FedMuon: Accelerating Federated Learning with Matrix Orthogonalizationofficial
FedOneICML 2025FedOne: Query-Efficient Federated Learning for Black-box Discrete Prompt Learning
HybridSGDarXiv 2025Communication-Efficient, 2D Parallel Stochastic Gradient Descent for Distributed-Memory Optimization
Kuramoto-FedAvgarXiv 2025Kuramoto-FedAvg: Using Synchronization Dynamics to Improve Federated Learning Optimization under Statistical Heterogeneityofficial
LQ-SGDarXiv 2025Trustworthy Efficient Communication for Distributed Learning using LQ-SGD Algorithm
MuonarXiv 2025Muon is Scalable for LLM TrainingofficialMuon
pFedSOParXiv 2025pFedSOP: Accelerating Training Of Personalized Federated Learning Using Second-Order Optimization
LT-ADMMTAC 2026Communication-Efficient Stochastic Distributed Learning
Ringleader ASGDICLR 2026Ringleader ASGD: The First Asynchronous SGD with Optimal Time Complexity under Data Heterogeneity
DECAarXiv 2026DECA: Decentralizing Block-Wise Adam for Efficient LLM Full-Parameter Fine-Tuning on Non-IID Data
Ringmaster LMOarXiv 2026Ringmaster LMO: Asynchronous Linear Minimization Oracle Momentum Method
SignMuonarXiv 2026SignMuon: Communication-Efficient Distributed Muon Optimization
Orth-DionarXiv 2026Orth-Dion: Eliminating Geometric Mismatch in Distributed Low-Rank Spectral Optimization
EF21-MuonarXiv 2025Error Feedback for Muon and Friends
MuonBPICLR 2026MuonBP: Faster Muon via Block-Periodic Orthogonalization
CurvaDionarXiv 2025CurvaDion: Curvature-Adaptive Distributed Orthonormalization
Quasi-Newton FL with Error FeedbackOPT 2025: Optimization for Machine Learning (NeurIPS 2025 Workshop)Quasi-Newton Methods for Federated Learning with Error Feedback
DeMuonarXiv 2025DeMuon: A Decentralized Muon for Matrix Optimization over Graphs
HeLoCoarXiv 2026HeLoCo: Efficient asynchronous low-communication training under data and device heterogeneity
Decoupled DiLoCoarXiv 2026Decoupled DiLoCo for Resilient Distributed Pre-training
Partial Parameter UpdatesarXiv 2025Partial Parameter Updates for Efficient Distributed Training
SparseLoCoarXiv 2025Communication Efficient LLM Pre-training with SparseLoCoofficial
GASLoCarXiv 2026Unifying Local Communications and Local Updates for LLM Pretraining
MG-ADSGDarXiv 2026Accelerated Decentralized Stochastic Gradient Descent for Strongly Convex Optimization
Local MixVRarXiv 2026Local MixVR: Breaking the Communication-Sample Dependence in Distributed Learning
LOSCAR-SGDarXiv 2026LOSCAR-SGD: Local SGD with Communication-Computation Overlap and Delay-Corrected Sparse Model Averaging
HEW-Local SGDarXiv (math.OC) 2026Heterogeneous-Horizon Exact-Weight Local SGD
CAPTAIN (C-ALADIN)arXiv 2026A Global Convergence Analysis of Consensus ALADIN for Convex Optimization
FedPACarXiv 2026Taming Preconditioner Drift: Unlocking the Potential of Second-Order Optimizers for Federated Learning on Non-IID Dataofficial
FedAdamWAAAI 2026FedAdamW: A Communication-Efficient Optimizer with Convergence and Generalization Guarantees for Federated Large Modelsofficial
LoRDOarXiv 2026LoRDO: Distributed Low-Rank Optimization with Infrequent Communication

Second-Order and Orthogonalized Optimizers

Second-order and orthogonalized optimizers exploit curvature information or the matrix structure of gradients rather than purely elementwise first-order statistics. This group spans quasi-Newton and Hessian-diagonal methods (L-BFGS, AdaHessian, Sophia), full-matrix and Kronecker-factored preconditioning (PSGD, Shampoo, SOAP), and orthogonalized-update methods in the Muon family. Venues reflect peer-reviewed acceptance where applicable; otherwise the arXiv year is listed.

OptimizerVenuePaperCodezij
Gauss-Newton MethodBiometrika 1974Quasi-likelihood functions, generalized linear models, and the Gauss-Newton method
Newton's MethodANL Technical Report 1982Newton's method (ANL-82-8)
L-BFGSMathematical Programming 1989On the limited memory BFGS method for large scale optimizationofficialLBFGS
Natural GradientNeural Computation 1998Natural Gradient Works Efficiently in Learning
K-FACICML 2015Optimizing Neural Networks with Kronecker-factored Approximate Curvature
PSGDIEEE TNNLS 2018Preconditioned Stochastic Gradient Descentofficial
ShampooICML 2018Shampoo: Preconditioned Stochastic Tensor OptimizationofficialShampoo
AdaHessianAAAI 2021ADAHESSIAN: An Adaptive Second Order Optimizer for Machine LearningofficialAdahessian
ApolloarXiv 2020Apollo: An Adaptive Parameter-wise Diagonal Quasi-Newton Method for Nonconvex Stochastic Optimizationofficial
K-BFGS / K-BFGS(L)NeurIPS 2020Practical Quasi-Newton Methods for Training Deep Neural Networks
SGNarXiv 2020On the Promise of the Stochastic Generalized Gauss-Newton Method for Training DNNs
SpiderSQNIEEE TNNLS 2022Faster Stochastic Quasi-Newton Methods
TKFACAAAI 2021A Trace-restricted Kronecker-Factored Approximation to Natural Gradient
SGDHessNeurIPS 2022Better SGD using Second-order Momentum
SketchySGDSIMODS 2024SketchySGD: Reliable Stochastic Optimization via Randomized Curvature Estimatesofficial
Distributed ShampooarXiv 2023A Distributed Data-Parallel PyTorch Implementation of the Distributed Shampoo Optimizer for Training Neural Networks At-Scaleofficial
mL-BFGSTMLR 2023mL-BFGS: A Momentum-based L-BFGS for Distributed Large-Scale Neural Network Optimization
SophiaICLR 2024Sophia: A Scalable Stochastic Second-order Optimizer for Language Model Pre-trainingofficialSophiaG
AdaFisherICLR 2025AdaFisher: Adaptive Second Order Optimization via Fisher Informationofficial
CRNASarXiv 2024Novel Optimization Techniques for Parameter Estimation
HesScaleICML 2024Revisiting Scalable Hessian Diagonal Approximations for Applications in Reinforcement Learningofficial
MuonBlog post 2024Muon: An optimizer for hidden layers in neural networksofficialMuon
NysActIEEE BigData 2024NysAct: A Scalable Preconditioned Gradient Descent using Nystrom Approximation
OptiQarXiv 2024Second-Order Optimization via Quiescence
Q-NewtonarXiv 2024Q-Newton: Hybrid Quantum-Classical Scheduling for Accelerating Neural Network Training with Newton's Gradient Descentofficial
SOAAarXiv 2024Efficient Second-Order Neural Network Optimization via Adaptive Trust Region Methods
SOAPICLR 2025SOAP: Improving and Stabilizing Shampoo using Adam for Language ModelingofficialSOAP
AdaDiagarXiv 2025Improving Adaptive Moment Optimization via Preconditioner Diagonalization
ADAGB2arXiv 2025Fast Stochastic Second-Order Adagrad for Nonconvex Bound-Constrained Optimization
AdaGOarXiv 2025AdaGrad Meets Muon: Adaptive Stepsizes for Orthogonal Updates
AdaMuonarXiv 2025AdaMuon: Adaptive Muon OptimizerofficialAdaMuon
ASGONeurIPS 2025ASGO: Adaptive Structured Gradient Optimizationofficial
AuONarXiv 2025AuON: A Linear-time Alternative to Orthogonal Momentum Updatesofficial
COSMOSarXiv 2025COSMOS: A Hybrid Adaptive Optimizer for Memory-Efficient Training of LLMsofficial
FUSEIEEE CAI 2025FUSE: First-Order and Second-Order Unified SynthEsis in Stochastic Optimization
Hessian-aware ScalingarXiv 2025First-ish Order Methods: Hessian-aware Scalings of Gradient Descent
MACIEEE ICDM 2025MAC: An Efficient Gradient Preconditioning using Mean Activation Approximated Curvature
MuonCliparXiv 2025Kimi K2: Open Agentic Intelligencecommunity
NorMuonICML 2026NorMuon: Making Muon more efficient and scalableofficialNorMuon
OCARICML 2025Online Curvature-Aware Replay: Leveraging 2nd Order Information for Online Continual Learning
PolarGradarXiv 2025PolarGrad: A Class of Matrix-Gradient Optimizers from a Unifying Preconditioning PerspectiveofficialPolarGrad
ROOTarXiv 2025ROOT: Robust Orthogonalized Optimizer for Neural Network Trainingofficial
S-BFGSarXiv 2025Efficient Stochastic BFGS methods Inspired by Bayesian Principles
SASSHAICML 2025SASSHA: Sharpness-aware Adaptive Second-order Optimization with Stable Hessian Approximationofficial
ScionICML 2025Training Deep Learning Models with Norm-Constrained LMOsofficialScion
SPlusarXiv 2025A Stable Whitening Optimizer for Efficient Neural Network TrainingofficialSPlus
Muon^2arXiv 2026Muon^2: Boosting Muon via Adaptive Second-Moment Preconditioning
NoraarXiv 2026Nora: Normalized Orthogonal Row Alignment for Scalable Matrix Optimizer
PionarXiv 2026Rethinking Muon Beyond Pretraining: Spectral Failures and High-Pass Remedies for VLA and RLVR
Spectral Sphere Optimizer (SSO)arXiv 2026Controlled LLM Training on Spectral Sphereofficial
LoRA-MuonarXiv 2026LoRA-Muon: Spectral Steepest Descent on the Low-Rank Manifold
FOAMarXiv 2026FOAM: Frequency and Operator Error-Based Adaptive Damping Method for Reducing Staleness-Oriented Error for Shampoo
MoussearXiv 2026Mousse: Rectifying the Geometry of Muon with Curvature-Aware Preconditioningofficial
FISMOarXiv 2026FISMO: Fisher-Structured Momentum-Orthogonalized Optimizer
DyKAFarXiv 2025DyKAF: Dynamical Kronecker Approximation of the Fisher Information Matrix for Gradient Preconditioning
Double Preconditioning (DoPr)arXiv 2026Double Preconditioning (DoPr): Optimization for Test-Time Performance, not Validation Loss
AdaCubicTMLR 2026AdaCubic: An Adaptive Cubic Regularization Optimizer for Deep Learningofficial
IFNSOarXiv 2026IFNSO: Iteration-Free Newton-Schulz Orthogonalizationofficial
CAOarXiv preprint 2025CAO: Curvature-Adaptive Optimization via Periodic Low-Rank Hessian Sketching
Turbo-MuonarXiv 2025Turbo-Muon: Accelerating Orthogonality-Based Optimization with Pre-Conditioningofficial
SR1 Cubic Quasi-NewtonarXiv 2025Symmetric Rank-One Quasi-Newton Methods for Deep Learning Using Cubic Regularization
KL-ShampooICLR 2026Understanding and Improving Shampoo and SOAP via Kullback-Leibler Minimizationofficial
LLQRarXiv 2026Layerwise LQR for Geometry-Aware Optimization of Deep Networksofficial
Freon / KaonarXiv 2026Muon is Not That Special: Random or Inverted Spectra Work Just as Well
ManoarXiv 2026Mano: Restriking Manifold Optimization for LLM Trainingofficial
AtlasOPT 2025: 17th Annual Workshop on Optimization for Machine Learning (co-located with NeurIPS 2025)Atlas – Rethinking Optimizer Design for Stability and Speed

Zeroth-Order Optimizers

Zeroth-order (gradient-free) methods train models using only function evaluations, estimating gradients from randomized perturbations of the parameters instead of backpropagation. Because they need no backward pass or activation storage, they run at roughly inference-level memory, which has made them a practical option for fine-tuning large language models on constrained hardware. The lineage runs from SPSA in classical stochastic approximation to recent variance-reduced and low-rank variants built on MeZO.

OptimizerVenuePaperCodezij
SPSAIEEE Transactions on Automatic Control 1992Multivariate stochastic approximation using a simultaneous perturbation gradient approximationofficial
Evolution StrategiesarXiv 2017Evolution Strategies as a Scalable Alternative to Reinforcement Learningofficial
ZO-AdaMMNeurIPS 2019ZO-AdaMM: Zeroth-Order Adaptive Momentum Method for Black-Box Optimizationofficial
MeZONeurIPS 2023Fine-Tuning Language Models with Just Forward Passesofficial
DeepZeroICLR 2024DeepZero: Scaling up Zeroth-Order Optimization for Deep Model Trainingofficial
LeZOarXiv 2024Simultaneous Computation and Memory Efficient Zeroth-Order Optimizer for Fine-Tuning Large Language Modelsofficial
MeZO-SVRGICML 2024Variance-reduced Zeroth-Order Methods for Fine-Tuning Language Modelsofficial
ZO-AdaMUAAAI 2024ZO-AdaMU Optimizer: Adapting Perturbation by the Momentum and Uncertainty in Zeroth-order Optimizationofficial
ZoProCDC 2024A Zeroth-Order Proximal Algorithm for Consensus Optimization
AddaxICLR 2025Addax: Utilizing Zeroth-Order Gradients to Improve Memory Efficiency and Performance of SGD for Fine-Tuning Language Modelsofficial
DiZONeurIPS 2025Harmony in Divergence: Towards Fast, Accurate, and Memory-efficient Zeroth-order LLM Fine-tuningofficial
ElasticZOarXiv 2025ElasticZO: A Memory-Efficient On-Device Learning with Combined Zeroth- and First-Order Optimization
HELENEEMNLP 2025HELENE: Hessian Layer-wise Clipping and Gradient Annealing for Accelerating Fine-tuning LLM with Zeroth-order Optimization
KerZOOarXiv 2025KerZOO: Kernel Function Informed Zeroth-Order Optimization for Accurate and Accelerated LLM Fine-Tuning
LORENZAarXiv 2025LORENZA: Enhancing Generalization in Low-Rank Gradient LLM Training via Efficient Zeroth-Order Adaptive SAM
LOZOICLR 2025Enhancing Zeroth-order Fine-tuning for Language Models with Low-rank Structuresofficial
MaZOarXiv 2025MaZO: Masked Zeroth-Order Optimization for Multi-Task Fine-Tuning of Large Language Models
QuZOEMNLP 2025QuZO: Quantized Zeroth-Order Fine-Tuning for Large Language Modelsofficial
R-AdaZOICML 2025Refining Adaptive Zeroth-Order Optimization at Easeofficial
Sparse MeZONeurIPS 2025Sparse MeZO: Less Parameters for Better Performance in Zeroth-Order LLM Fine-Tuningofficial
SubZeroICCV 2025Zeroth-Order Fine-Tuning of LLMs in Random Subspacesofficial
TeZOarXiv 2025TeZO: Empowering the Low-Rankness on the Temporal Dimension in the Zeroth-Order Optimization for Fine-tuning LLMs
VAMOarXiv 2025VAMO: Efficient Zeroth-Order Variance Reduction for SGD with Faster Convergence
VR-SZDarXiv 2025A Structured Proximal Stochastic Variance Reduced Zeroth-order Algorithmofficial
ZO-SAHarXiv 2025Subspace-based Approximate Hessian Method for Zeroth-Order Optimization
ZO2COLM 2025ZO2: Scalable Zeroth-Order Fine-Tuning for Extremely Large Language Models with Limited GPU Memoryofficial
ZOQOICASSP 2025ZOQO: Zero-Order Quantized Optimization
AdaMeZOarXiv 2026AdaMeZO: Adam-style Zeroth-Order Optimizer for LLM Fine-tuning Without Maintaining the Momentsofficial
FZOOICLR 2026FZOO: Fast Zeroth-Order Optimizer for Fine-Tuning Large Language Models towards Adam-Scale Speedofficial
MEAZOarXiv 2026On Adaptivity in Zeroth-Order Optimization
QZOICLR 2026Fine-tuning Quantized Neural Networks with Zeroth-order Optimizationofficial
GRZOarXiv 2026GRZO: Group-Relative Zeroth-Order Optimization for Large Language Model Fine-Tuning
AGZOICML 2026AGZO: Activation-Guided Zeroth-Order Optimization for LLM Fine-Tuning
ZO-MOPIarXiv 2026Accelerating Zeroth-Order Spectral Optimization with Partial Orthogonalization from Power Iterationofficial
ZO-MuonarXiv 2026Powering Up Zeroth-Order Training via Subspace Gradient Orthogonalizationofficial
RLR (Recursive Likelihood Ratio)ICLR 2026Half-order Fine-Tuning for Diffusion Model: A Recursive Likelihood Ratio Optimizerofficial
ZO Fine-tunerarXiv (accepted to ICML 2026) 2025Learning a Zeroth-Order Optimizer for Fine-Tuning LLMsofficial

Privacy-Preserving Optimizers

Privacy-preserving optimizers train models under differential privacy, typically by clipping per-sample gradients and adding calibrated noise to updates. This page lists differentially private optimization methods and reference libraries, from the original DP-SGD to later variants that reduce clipping bias, correct moment estimates, or filter privacy noise.

OptimizerVenuePaperCode
DP-SGDCCS 2016Deep Learning with Differential Privacyofficial
DP-LSSGDMSML 2020DP-LSSGD: A Stochastic Optimization Method to Lift the Utility in Privacy-Preserving ERMofficial
DP-PASGDarXiv 2020Differentially Private Federated Learning for Resource-Constrained Internet of Things
DP-SGD-JLNeurIPS 2021Fast and Memory Efficient Differentially Private-SGD via JL Projections
OpacusarXiv 2021Opacus: User-Friendly Differential Privacy Library in PyTorchofficial
A(DP)²SGDTPAMI 2022A(DP)²SGD: Asynchronous Decentralized Parallel Stochastic Gradient Descent with Differential Privacy
DPISCCS 2022DPIS: An Enhanced Mechanism for Differentially Private SGD with Importance Sampling
Top-DPTCSVT 2022Topology-aware Differential Privacy for Decentralized Image Classification
ANSGDarXiv 2023Learning across Data Owners with Joint Differential Privacy
DP-FedSAMCVPR 2023Make Landscape Flatter in Differentially Private Federated Learningofficial
AClipped-dpSGDMachine Learning 2024Efficient Private SCO for Heavy-Tailed Data via Averaged Clipping
DiceSGDICLR 2024Differentially Private SGD Without Clipping Bias: An Error-Feedback Approachofficial
DOPPLERNeurIPS 2024DOPPLER: Differentially Private Optimizers with Low-pass Filter for Privacy Noise Reduction
DP-AdamBCAAAI 2024DP-AdamBC: Your DP-Adam Is Actually DP-SGD (Unless You Apply Bias Correction)official
FedLAP-DPPoPETs 2024FedLAP-DP: Federated Learning by Sharing Differentially Private Loss Approximationsofficial
DC-SGDTIFS 2025DC-SGD: Differentially Private SGD with Dynamic Clipping through Gradient Norm Distribution Estimation
DP-AdamWICML Workshop 2025DP-AdamW: Investigating Decoupled Weight Decay and Bias Correction in Private Deep Learning
DP-MicroAdamarXiv 2025DP-MicroAdam: Private and Frugal Algorithm for Training and Fine-tuning
DPZVarXiv 2025Communication-Efficient and Differentially Private Vertical Federated Learning with Zeroth-Order Optimization
GeoDPICDE 2025Analyzing and Optimizing Perturbation of DP-SGD Geometricallyofficial
Interleaved-ShuffleGarXiv 2025Improving the Convergence of Private Shuffled Gradient Methods with Public Data
Logit-DPICLR 2025Differentially Private Optimization for Non-Decomposable Objective Functions
SPARTAKDD 2025SPARTA: An Optimization Framework for Differentially Private Sparse Fine-Tuningofficial
DP-λCGDarXiv 2026DP-λCGD: Efficient Noise Correlation for Differentially Private Model Training
PINAICASSP 2026Differentially Private Clustered Federated Learning with Privacy-Preserving Initialization and Normality-Driven Aggregation
RaCO-DPICLR 2026Private Rate-Constrained Optimization with Applications to Fair Learningofficial
DP-MacAdamarXiv 2026DP-MacAdam: Differentially Private Mechanism with Adaptive Clipping and Adaptive Momentum
FO-DP-SGDarXiv 2026Deep Learning under Fractional-Order Differential Privacy
Hyperparameter-free DP optimization (GeN-DP)ICLR 2025Towards hyperparameter-free optimization with differential privacy
DP-MuonarXiv 2026DP-Muon: Differentially Private Optimization via Matrix-Orthogonalized Momentum
TP-TopKarXiv 2026When Do Fewer Coordinates Suffice in DP-SGD?
DPDLarXiv 2026DPDL: Towards Differential Privacy Preservation in Decentralized Stochastic Learning on Non-IID Data
DP-SGD-RCICML 2026Efficient DP-SGD for LLMs with Randomized Clipping
PRISMICML 2026PRISM: Gauge-Invariant Tangent-Space Differentially Private LoRA
SMA-DP-SGDarXiv 2026SMA-DP: Spectral Memory-Aware Differential Privacy for Deep Learning
FiBeRarXiv 2026FIBER: A Differentially Private Optimizer with Filter-Aware Innovation Bias Correction
DP-KFCICML 2026DP-KFC: Data-Free Preconditioning for Privacy-Preserving Deep Learningofficial
DP-FedAdamWCVPR 2026DP-FedAdamW: An Efficient Optimizer for Differentially Private Federated Large Models
Lap2IEEE CSF 2026Lap2: Revisiting Laplace DP-SGD for High Dimensions via Majorization Theoryofficial
Clip21-SGD2MarXiv 2025Double Momentum and Error Feedback for Clipping with Fast Rates and Differential Privacy

Sharpness-Aware Optimizers

Sharpness-aware methods seek parameters that lie in neighborhoods with uniformly low loss rather than at isolated minima, which tends to improve generalization. Introduced by SAM (Foret et al., ICLR 2021), these methods wrap a base optimizer such as SGD or AdamW and add a gradient ascent perturbation step before the descent update. Later work makes the perturbation scale-invariant, closes the surrogate gap, reweights the sharpness term, amortizes the extra forward-backward cost, or extends the idea to second-order optimization.

OptimizerVenuePaperCodezij
SAMICLR 2021Sharpness-Aware Minimization for Efficiently Improving GeneralizationcommunitySAM
ASAMICML 2021ASAM: Adaptive Sharpness-Aware Minimization for Scale-Invariant Learning of Deep Neural NetworkscommunityASAM
ESAMICLR 2022Efficient Sharpness-aware Minimization for Improved Training of Neural Networks
GSAMICLR 2022Surrogate Gap Minimization Improves Sharpness-Aware TrainingofficialGSAM
LookSAMCVPR 2022Towards Efficient and Scalable Sharpness-Aware MinimizationcommunityLookSAM
AE-SAMICLR 2023An Adaptive Policy to Employ Sharpness-Aware Minimization
bSAMICLR 2023SAM as an Optimal Relaxation of Bayesofficial
GAMCVPR 2023Gradient Norm Aware Minimization Seeks First-Order Flatness and Improves Generalization
WSAMKDD 2023Sharpness-Aware Minimization Revisited: Weighted Sharpness as a Regularization TermofficialWSAM
AdaSAMNeural Networks 2024AdaSAM: Boosting Sharpness-Aware Minimization with Adaptive Learning Rate and Momentum for Training Deep Neural Networks
F-SAMCVPR 2024Friendly Sharpness-Aware Minimizationofficial
FGSAMNeurIPS 2024Fast Graph Sharpness-Aware Minimization for Enhancing and Accelerating Few-Shot Node Classification
Lookbehind-SAMICML 2024Lookbehind-SAM: k steps back, 1 step forward
MSAMarXiv 2024Momentum-SAM: Sharpness Aware Minimization without Computational Overheadofficial
SAMPaNeurIPS 2024SAMPa: Sharpness-aware Minimization Parallelized
AsyncSAMarXiv 2025Asynchronous Sharpness-Aware Minimization For Fast and Accurate Deep Learning
GCSAMarXiv 2025GCSAM: Gradient Centralized Sharpness Aware Minimizationofficial
LightSAMarXiv 2025LightSAM: Parameter-Agnostic Sharpness-Aware Minimization
SASSHAICML 2025SASSHA: Sharpness-aware Adaptive Second-order Optimization with Stable Hessian Approximationofficial
SSAMJMLR 2025Stabilizing Sharpness-aware Minimization Through A Simple Renormalization Strategy
SAM-Polyak (Adaptive SAM with Polyak step size)ICML 2026Adaptive Sharpness-Aware Minimization with a Polyak-type Step size: A Theory-Grounded Schedulerofficial
X-SAMarXiv 2026X-SAM: Boosting Sharpness-Aware Minimization with Dominant-Eigenvector Gradient Correction
M-SAM (Modality-Aware SAM)NeurIPS 2025Modality-Aware SAM: Sharpness-Aware-Minimization Driven Gradient Modulation for Harmonized Multimodal Learning
ZSharp (SAM with Z-Score Gradient Filtering)NeurIPS 2025 OPT Workshop (also accepted to ICASSP 2026)Sharpness-Aware Minimization with Z-Score Gradient Filteringofficial
Focal-SAMICML 2025Focal-SAM: Focal Sharpness-Aware Minimization for Long-Tailed Classificationofficial
Functional SAMICML 2025Avoiding spurious sharpness minimization broadens applicability of SAM
FedGMTICML 2025One Arrow, Two Hawks: Sharpness-aware Minimization for Federated Learning via Global Model Trajectoryofficial
LE-SAMICML 2026Fix the Loss, Not the Radius: Rethinking the Adversarial Perturbation of Sharpness-Aware Minimization

Quantum and Quantum-Inspired Optimizers

This page collects optimizers from two adjacent settings. The first is the optimization of variational quantum circuits, where shot noise and the quantum geometry of the parameter space drive the design of measurement-frugal, gradient-free, and natural-gradient methods. The second is quantum-inspired and quantum-hardware optimization of classical neural networks, where quantum fluctuations, adiabatic evolution, or annealer sampling replace or augment the classical training loop.

Optimizers for variational quantum circuits

OptimizerVenuePaperCode
SPSAIEEE Transactions on Automatic Control 1992Multivariate stochastic approximation using a simultaneous perturbation gradient approximationofficial
iCANSQuantum 2020An Adaptive Optimizer for Measurement-Frugal Variational Algorithmscommunity
NFTPhysical Review Research 2020Sequential minimal optimization for quantum-classical hybrid algorithmsofficial
Quantum Natural GradientQuantum 2020Quantum Natural Gradientofficial
RosalinarXiv 2020Operator Sampling for Shot-frugal Optimization in Variational Algorithmscommunity
QN-SPSAQuantum 2021Simultaneous Perturbation Stochastic Approximation of the Quantum Fisher Informationofficial
Rotosolve / RotoselectQuantum 2021Structure optimization for parameterized quantum circuitscommunity
Quantum Analytic DescentPhysical Review Research 2022Quantum Analytic Descentofficial
SGLBOnpj Quantum Information 2022Stochastic gradient line Bayesian optimization for efficient noise-robust optimization of parameterized quantum circuitscommunity
SantaQlausarXiv 2023SantaQlaus: A resource-efficient method to leverage quantum shot-noise for optimization of variational quantum algorithms
ExcitationSolveCommunications Physics 2025Fast gradient-free optimization of excitations in variational quantum eigensolversofficial
Kernel DescentScientific Reports 2025Introducing the kernel descent optimizer for variational quantum algorithms
QUIVERarXiv 2026Adaptive directional gradients for parameterised quantum circuits
WSBDAISTATS 2026WSBD: Freezing-Based Optimizer for Quantum Neural Networksofficial
H-QNGarXiv 2025Efficient Hamiltonian-aware Quantum Natural Gradient Descent for Variational Quantum Eigensolvers
WA-QNGQuantum Science and Technology 2026Weighted Approximate Quantum Natural Gradient for Variational Quantum Eigensolver
CQNGEPJ Quantum Technology 2025Modified Conjugate Quantum Natural Gradient
Momentum-QNGPhysica A 2024Application of Langevin Dynamics to Advance the Quantum Natural Gradient Optimization Algorithmofficial
qBangQuantum 2024Optimizing Variational Quantum Algorithms with qBang: Efficiently Interweaving Metric and Momentum to Navigate Flat Energy Landscapesofficial
EGT (Exact Geodesic Transport)arXiv 2025Quantum optimization with exact geodesic transport
TGF / TGFQSarXiv 2026Two-Gate Extensions of Free Axis and Free Quaternion Selection for Sequential Optimization of Parameterized Quantum Circuits
SGD (Superpositional Gradient Descent)IEEE QAI 2025Superpositional Gradient Descent: Harnessing Quantum Principles for Model Training
Scalable On-Hardware QNN training (parallelised parameter-shift rule)arXiv 2026Scalable On-Hardware Training of Quantum Neural Networks and Application to Clinical Data Imputation
QM-quantization optimizer (Schrodinger gradient-flow)arXiv 2026Quantum mechanical framework for quantization-based optimization: from Gradient flow to Schroedinger equation

Quantum-inspired and quantum-hardware methods

OptimizerVenuePaperCode
Quantum AdamScientific Reports 2018Optimization of neural networks via finite-value quantum fluctuations
RBM training on a D-Wave annealerFrontiers in Physics 2021Training Restricted Boltzmann Machines With a D-Wave Quantum Annealer
Quantum Hamiltonian Descent (QHD)arXiv 2023Quantum Hamiltonian Descentofficial
Universal AQC neural-network trainingFrontiers in Artificial Intelligence 2024Training neural networks with universal adiabatic quantum computing
QHDOPTINFORMS Journal on Computing 2025QHDOPT: A Software for Nonlinear Optimization with Quantum Hamiltonian Descentofficial
Stochastic Quantum Hamiltonian Descent (SQHD)arXiv 2025Stochastic Quantum Hamiltonian Descent
QIASOAIMS Mathematics 2026The quantum-inspired adaptive superposition optimization for neural network training

Learning-Rate-Free Optimizers

Learning-rate-free (also called parameter-free or tuning-free) optimizers select their step size automatically during training instead of requiring a manually tuned learning rate. Most methods in this family estimate a quantity such as the distance from the initial point to the solution and set the effective step size from observed gradients, while others wrap an existing base optimizer and tune its global scale factor online. The goal is to match the performance of a well-tuned baseline without a learning-rate search.

OptimizerVenuePaperCodezij
AdGDICML 2020Adaptive Gradient Descent without Descentofficial
ALI-GICML 2020Training Neural Networks for and by Interpolationofficial
AdaBFEarXiv 2022BFE and AdaBFE: A New Approach in Learning Rate Automation for Stochastic Optimization
D-AdaptationICML 2023Learning-Rate-Free Learning by D-AdaptationofficialDAdaptSGD, DAdaptAdam
DoGICML 2023DoG is SGD's Best Friend: A Parameter-Free Dynamic Step Size ScheduleofficialDoG, LDoG
MechanicNeurIPS 2023Mechanic: A Learning Rate Tunerofficialmechanize
Adam++arXiv 2024Towards Simple and Provable Parameter-Free Adaptive Gradient Methods
MoMoICML 2024MoMo: Momentum Models for Adaptive Learning RatesofficialMomo, MomoAdam
ProdigyICML 2024Prodigy: An Expeditiously Adaptive Parameter-Free LearnerofficialProdigy
AdamGarXiv 2024Towards Stability of Parameter-free OptimizationcommunityAdamG
TRACNeurIPS 2024Fast TRAC: A Parameter-Free Optimizer for Lifelong Reinforcement LearningofficialTRAC
Accelerated GRAALarXiv 2025Nesterov Finds GRAAL: Optimal and Adaptive Gradient Method for Convex Optimization
AutoSGDarXiv 2025AutoSGD: Automatic Learning Rate Selection for Stochastic Gradient Descent
EAGLEarXiv 2025eagle: early approximated gradient based learning rate estimator
ScheduleFree+arXiv 2026ScheduleFree+: Scaling Learning-Rate-Free & Schedule-Free Learning to Large Language Modelsofficial
AMUSEarXiv 2026AMUSE: Anytime Muon with Stable Gradient Evaluation
Adaptive Polyak Steps (SF-SGD / SF-Adam)arXiv 2025Taking the Road Less Scheduled with Adaptive Polyak Steps
GGD (Geodesic Gradient Descent)arXiv 2026Geodesic Gradient Descent: A Generic and Learning-rate-free Optimizer on Objective Function-induced Manifolds
Accelerated Distance-adaptive Method (DoG-lineage)NeurIPS 2025Accelerated Distance-adaptive Methods for Hölder Smooth and Convex Optimization
GeNICLR 2025Gradient descent with generalized Newton's methodofficial
DoWGNeurIPS 2023DoWG Unleashed: An Efficient Universal Parameter-Free Gradient Descent Methodofficial
U-DoGCOLT 2024Accelerated Parameter-Free Stochastic Optimization
Sign-SGD via Parameter-Free OptimizationICLR 2026Sign-SGD via Parameter-Free Optimization
OptEMAarXiv 2026OptEMA: Adaptive Exponential Moving Average for Stochastic Optimization with Zero-Noise Optimality

Learning Rate Schedulers

zij.core.lr_scheduler vendors the PyTorch core learning rate schedulers under their original class names. The first table lists every vendored class, including the LRScheduler base class, with the published work it derives from where one exists. The second table covers notable schedules from the literature that zij does not yet implement.

In zij

SchedulerOrigin
ChainedScheduler
ConstantLR
CosineAnnealingLRLoshchilov & Hutter ICLR 2017 (SGDR)
CosineAnnealingWarmRestartsLoshchilov & Hutter ICLR 2017 (SGDR)
CyclicLRSmith WACV 2017 (cyclical learning rates)
ExponentialLR
LambdaLR
LinearLR
LRScheduler
MultiplicativeLR
MultiStepLR
OneCycleLRSmith & Topin 2019 (super-convergence)
PolynomialLR
ReduceLROnPlateau
SequentialLR
StepLR

Notable schedules elsewhere

SchedulerVenuePaperCodezij
Inverse square rootNeurIPS 2017Attention Is All You Needofficial
AdaSarXiv 2020AdaS: Adaptive Scheduling of Stochastic Gradientsofficial
Untuned WarmupAAAI 2021On the adequacy of untuned warmup for adaptive optimization
AutoDropUAI 2024AutoDrop: Training Deep Learning Models with Automatic Learning Rate Drop
Schedule-FreeNeurIPS 2024The Road Less ScheduledofficialSGDScheduleFree, AdamWScheduleFree, RAdamScheduleFree, ScheduleFreeWrapper
WSD (Warmup-Stable-Decay)COLM 2024MiniCPM: Unveiling the Potential of Small Language Models with Scalable Training Strategiesofficial
GreedyLRarXiv 2025Dynamic Learning Rate Scheduling based on Loss Changes Leads to Faster Convergence
Refined SF-AdamWNeurIPS 2025Through the River: Understanding the Benefit of Schedule-Free Methods for Language Model Training
SF-NorMuonarXiv 2026Anytime Training with Schedule-Free Spectral Optimization
WSMICLR 2026WSM: Decay-Free Learning Rate Schedule via Checkpoint Merging for LLM Pre-training
Power Decay / Warmup-Stable-Decay (WSD)arXiv 2026Optimal Learning-Rate Schedules under Functional Scaling Laws: Power Decay and Warmup-Stable-Decay
Anytime (Horizon-Free WA schedule)arXiv 2026Anytime Pretraining: Horizon-Free Learning-Rate Schedules with Weight Averaging

Schedule-Free is not a schedule on top of an optimizer but a replacement for scheduling, achieved through online iterate averaging inside the optimizer; see the learning-rate-free optimizers.

Weight averaging is available separately in zij.core.swa_utils, which provides stochastic weight averaging and exponential moving average utilities (AveragedModel, SWALR, update_bn, and the SWA/EMA averaging functions), following Averaging Weights Leads to Wider Optima and Better Generalization (Izmailov et al., UAI 2018).

How zij compares

Two kinds of project cover this ground: curated awesome-lists, and installable optimizer collections. zij (زِيج) is both.

CapabilityAwesome-listsLibrary collectionszij
Curated reference of the whole fieldYesYes
Installable, tested implementationsYesYes
Paper-only methods includedYesYes
Update rule in standard notationYes
Per-file provenance (upstream, commit, license)PartialYes
Dedicated fractional-order coverageYes
Dedicated quantum / quantum-inspired coverageYes

Engineering standards

  • The Canon and the code are one project. Every Canon row links the paper and, where it exists, the implementation. Every implementation links back to its source and paper.
  • Provenance is explicit. Vendored files record their upstream repository, pinned commit, and license; THIRD_PARTY_NOTICES.md aggregates the attributions. Sources under GPL, non-commercial, or no license are not vendored and remain listed only.
  • Mathematics is explicit. Each update rule is written in standard notation. Where an official implementation diverges from its own paper, the docstring records what the code computes.
  • Everything is tested. Every registered optimizer has convergence and state-dict round-trip tests.

Contributing

New implementations, Canon entries, and corrections are welcome. See CONTRIBUTING.md. A Canon correction counts as much as a code change.

Acknowledgments

zij (زِيج) builds on the projects it learns from:

  • APRIL-AIGC/Awesome-Optimizer: an awesome-list whose breadth helped inform this project's scope.
  • kozistr/pytorch_optimizer: a comprehensive, maintained PyTorch optimizer collection, and a reference for several vendored implementations.
  • jettify/pytorch-optimizer: an early community optimizer collection and the source of several classic implementations.
  • timm: tested optimizer implementations and packaging conventions.
  • PyTorch: the torch.optim core that zij.core mirrors.
  • The optimizer authors: each method is someone's research. The canonical paper is cited in every Canon row and class docstring, and the original repository is credited per file in THIRD_PARTY_NOTICES.md.

Citation

If you use an optimizer from this library, cite two works: the original paper of the algorithm (linked in its Canon row and docstring), and zij as the software you ran. The paper credits the method; the software credits the implementation.

@software{raja_zij,
  author = {Raja, Muhammad Junaid Ali Asif},
  title  = {zij: A Canon and Library of Deep Learning Optimizers},
  year   = {2026},
  url    = {https://github.com/junaidaliop/zij}
}

Machine-readable metadata is in CITATION.cff.

License

Apache-2.0. Vendored components retain their original licenses; see THIRD_PARTY_NOTICES.md.

Contact

Muhammad Junaid Ali Asif Raja — muhammadjunaidaliasifraja@gmail.com