Evolution of Optimization Methods: Algorithms, Scenarios, and Evaluations
June 12, 2026 · View on GitHub
Tong Zhang,
Jiangning Zhang†,
Zhucun Xue,
Juntao Jiang,
Yicheng Xu,
Chengming Xu,
Teng Hu,
Xingyu Xie,
Xiaobin Hu†,
Yabiao Wang,
Yong Liu†,
Shuicheng Yan
📖 Overview
Foundational optimization algorithms are the core driving force behind deep learning, evolving from early stochastic gradient descent (SGD) to the widely adopted Adam family. However, as the scale of modern foundation models grows massively, this optimization paradigm is forced to expand, encountering new physical and systemic bottlenecks during large-scale training. In particular, stringent differential privacy requirements and distributed training paradigms have exposed critical limitations of conventional approaches regarding privacy protection and memory efficiency.
However, existing reviews on optimization algorithms often focus on narrow technical fields, e.g., first-order and second-order, lacking a comprehensive perspective on the field's evolution, especially regarding Zeroth-order and Scenario-oriented paradigms.
To address these gaps, this survey provides a systematic review of the development of optimization algorithmss, tracing its evolution through four major paradigms:
First-order methods → Second-order methods → Zeroth-order methods → Scenario-oriented paradigms
We conduct comprehensive theoretical analysis and standardized empirical evaluations, objectively pointing out the pros, cons, and fundamental design trade-offs of various methods across different architectures. By synthesizing theoretical insights with extensive empirical evidence, we distill key developmental trends and provides actionable guidance and future research directions for designing next-generation efficient, robust, and trustworthy optimization algorithms.
🎯Contributions
1️⃣ Unified Taxonomy: Establishing a rigorous mathematical taxonomy that unifies disparate conceptual definitions across fundamental optimization primitives.
- 📊 Evolutionary Trajectory tracing the development of foundational algorithms from First-Order to Second-Order and Zeroth-Order methods.
- 🔬 Intrinsic Connections clarifying the complex evolutionary logic and structural relationships between different optimization approaches to provide a coherent framework for the field.
2️⃣ Scenario-Oriented Analysis: Demonstrating how foundational algorithms are fundamentally re-architected into scenario-oriented paradigms to address severe physical bottlenecks.
- 📈 Systems-Aware Engineering highlighting the critical shift from pure algorithmic design to practical solutions that balance theoretical guarantees with strict engineering constraints.
- 🔍 Overcoming Systemic Barriers detailing how these paradigms tackle specific, real-world challenges such as distributed communication barriers and strict differential privacy constraints.
3️⃣ Standardized Evaluation: Introducing a rigorously controlled evaluation framework that strictly separates pure algorithmic performance from large-scale engineering optimizations.
- 🚀 Extensive Benchmarking developing a standardized testbed to evaluate 23 distinct optimizers across diverse architectural proxies, including CNN and Transformer-based models.
- 🔮 Strategic Insights systematically isolating and examining learning rate sensitivity, long-term training scalability, and cross-architecture generalization to guide the design of next-generation optimizers.
📈Evolution
📈 A Comprehensive Analysis of Optimization Methods: This figure systematically summarizes the development trends and core characteristics of optimization methodologies across different orders.
- Key Insights:Attention to optimization algorithms experienced a sharp increase since 2024. This explosive growth is closely tied to the rapid development of massive models, with first-order methods maintaining a dominant position.
🚩Timeline
Timeline of prominent optimization algorithms. The evolution highlights key algorithmic milestones, associated research institutions, and publication venues over time.
🏗️ Architecture Overview
🗂️ A Comprehensive Taxonomy of Optimization Algorithms
📐 Taxonomy Overview: This framework categorizes existing works based on three dominant paradigms, First-Order Methods, Second-Order Methods, and Zeroth-Order Methods, and further structures them according to their fundamental mathematical principles and evolutionary development. Key branches include:
- 🚀 First-Order Methods: Gradient-Driven (e.g., SGD) Adaptive Learning Rate (e.g., Adam) Acceleration to Automation (e.g., Adan, Nadam) Scalar to Preconditioner (e.g., Shampoo) Stability to Temporal (e.g., SPAM) Temporal to Geometry (e.g., SAM).
- ⚙️ Second-Order Methods: Deterministic Curvature to Geometry (e.g., K-FAC, AdaFisher) Approximation to Iterative Update (e.g., ADAHESSIAN).
- 📍 Zeroth-Order Methods: Perturbation Optimization (e.g., FZOO, LeZO) Adaptive to Resource-Aware (e.g., MeZO, ZO-AdaMM) Variance Reduction to Adaptive (e.g., MeZO-SVRG).
📊 Benchmark Evaluation Results on Vision Tasks
📈 Benchmark Evaluation: This comprehensive assessment evaluates 23 representative optimization algorithms across continuous vision architectures (ViT-S and ResNet-50) and varying training horizons:
- ⏱️ Short-Term Convergence (100 Epochs): Evaluates the rapid descent capability and initial exploration efficiency of optimizers within a constrained computational budget.
- 🏃 Long-Term Scalability (300 Epochs): Assesses the algorithm's resilience against late-stage gradient noise and its capacity to continually extract representational power over extended cycles.
- 📊 Ranking Dynamics: Tracks relative performance shifts across epochs, highlighting how algorithms dynamically navigate the trade-off between early acceleration and long-term stability.
🔥Add Your Paper in our Survey!!!!!
- You are welcome to give us an issue or PR for your optimizer work !!!!!
Note that: Due to the huge paper in arXiv, we are sorry to cover all in our survey. You can directly present a PR into this repo and we will record it for next version update of our survey.
🔥New
- [2026.06.12] We update GitHub to record the available paper by the end of 2026/6/12.
🌟Welcome everyone to follow and join the Scaling Opt community: Scaling Opt Community
-
Algorithm Visualizations: Includes visualization scripts for the Rosenbrock Function and Rastrigin Function, allowing users to freely explore optimization behaviors.
-
Performance Benchmarks: We primarily recommend benchmarks based on Algoperf, along with other benchmark suites and analysis articles for validating and comparing state-of-the-art optimizers.
-
Papers & Blogs Recommendations: The platform curates high-quality papers and blog posts from recent years, continuously updated with the latest daily arXiv publications. Currently, the collection contains nearly one hundred resources.
-
Tutorials Sharing: The platform gathers high-quality community resources and is actively developing a tutorial series titled From Classical to Modern Optimizers.
🔨Installation
To reproduce our benchmarks, you need to clone this repository and install the required dependencies. We strongly recommend using a virtual environment (e.g., Conda).
# 1. Clone the repository
git clone https://github.com/JZhangTon/awesome-optimizer.git
cd awesome-optimizer
# 2. Install required packages
pip install -r requirements.txt
⚙️Usage & 📈Benchmarking
- To start a training run and reproduce our benchmark results, you can execute the provided training scripts. We provide a script for easy benchmarking. See examples/benchmark to see how to use it.
🗂️Taxonomy of Optimization Algorithms
- 🚀 First-Order Algorithms
- ⚙️ Second-Order Algorithms
- 📍 Zeroth-Order Algorithms
- 🌐 Distributed Optimization
- 🛡️ Privacy-Preserving Optimization
- ⚡ Memory-Efficient Optimization
- 🧩 Tailored Optimization Approaches
🚀 First-Order Algorithms
| Abbreviation | Venue & Year | Paper Title | Project | Sub-methods | Fine-grained Methods |
|---|---|---|---|---|---|
| Pion | Rethinking Muon Beyond Pretraining: Spectral Failures and High-Pass Remedies for VLA and RLVR | Link | Spectral High Pass | Matrix Orthogonalization | |
| C-Adam | A Theoretical and Experimental Study of a Novel Adaptive Learning Algorithm | Link | Adaptive Learning Rate Methods | Line-of-Sight Adam Variant | |
| SparseOpt | SparseOpt: Addressing Normalization-induced Gradient Skew in Sparse Training | Link | Gradient Normalization & Clipping | Sparse Training Gradient Rebalancing | |
| Anon | Anon: Extrapolating Optimizer Adaptivity Across the Real Spectrum | Link | Adaptive Learning Rate Methods | ||
| PS-Clip-SGD | Robust and Fast Training via Per-Sample Clipping | Link | Enhancing Training Stability | ||
| Nora | Nora: Normalized Orthogonal Row Alignment for Scalable Matrix Optimizer | Link | Adaptive Step-Size Control | Matrix Orthogonalization | |
| Muon^2 | Muon^2: Boosting Muon via Adaptive Second-Moment Preconditioning | Link | Adaptive Step-Size Control | Matrix Orthogonalization | |
| HomeAdam | HomeAdam: Adam and AdamW Algorithms Sometimes Go Home to Obtain Better Provable Generalization | Link | Adaptive Step-Size Control | Second-order moment adaptation | |
| FlashOptim | FlashOptim: Optimizers for Memory-Efficient Training | Link | Memory-Efficient Optimization | Low-Memory Optimizer Design | |
| FANoS | FANoS: Friction-Adaptive Nos´e–Hoover Symplectic Momentum for Stiff Objectives | Link | Accelerating Convergence Rate | Momentum Damping Mechanism | |
| NOVAK | NOVAK: Unified adaptive optimizer for deep neural networks | Link | Hybrid Methods | Gradient Smoothing Hybrid | |
| AdamNX | AdamNX: An Adam improvement algorithm based on a novel exponential decay mechanism for the second-order moment estimate | Link | Adaptive Learning Rate Methods | Second-Order Moment Adaptation | |
| ROOT | ROOT: Robust Orthogonalized Optimizer for Neural Network Training | Link | Adaptive Learning Rate Methods | Momentum-based Adaptive | |
| AuON | AuON: A Linear-time Alternative to Orthogonal Momentum Updates | Link | Gradient Normalization & Clipping | Layer-Wise Gradient Normalization | |
| ZetA | ZETA: A HYBRID OPTIMIZER COMBINING RIEMANN ZETA SCALING WITH ADAM FOR ROBUST DEEP LEARNING | Link | Hybrid Methods | Multi-Objective Hybrid | |
| NIRMAL | COMPARATIVE ANALYSIS OF NOVEL NIRMAL OPTIMIZER AGAINST ADAM AND SGD WITH MOMENTUM | Link | Hybrid Methods | Multi-Objective Hybrid | |
| SCSAdamW | Beyond First-Order: Training LLMs with Stochastic Conjugate Subgradients and AdamW | Link | Loss Landscape Optimization | Momentum Landscape Adaptation | |
| adaNPAG | Boosting Accelerated Proximal Gradient Method with Adaptive Sampling for Stochastic Composite Optimization * | Link | Momentum-Enhanced SGD | Accelerated Momentum | |
| SoftSignSGD | SoftSignSGD(S3): An Enhanced Optimizer for Practical DNN Training and Loss Spikes Minimization Beyond Adam | Link | Adaptive Learning Rate Methods | Hybrid Adaptive Strategy | |
| AdaMuon | ADAMUON: ADAPTIVE MUON OPTIMIZER | Link | Adaptive Learning Rate Methods | Hybrid Adaptive Strategy | |
| Accelerated GRAAL | NESTEROV FINDS GRAAL: OPTIMAL AND ADAPTIVE GRADIENT METHOD FOR CONVEX OPTIMIZATION | Link | Hybrid Methods | Multi-Objective Hybrid | |
| DEO | Dimer-Enhanced Optimization: A First-Order Approach to Escaping Saddle Points in Neural Network Training | Link | Loss Landscape Optimization | Curvature-Guided Landscape Exploration | |
| LyAm | LyAm: Robust Non-Convex Optimization for Stable Learning in Noisy Environments | Link | Learning Rate Scheduling | Stability-Aware Adaptive Scheduling | |
| Splus | A Stable Whitening Optimizer for Efficient Neural Network Training | Link | Preconditioned Gradient Methods | Two Metrics' Preconditioner | |
| HGM | Hindsight-Guided Momentum (HGM) Optimizer: An Approach to Adaptive Learning Rates | Link | Learning Rate Scheduling | Gradient Angle Scheduling | |
| AutoSGD | AutoSGD: Automatic Learning Rate Selection for Stochastic Gradient Descent | Link | Learning Rate Scheduling | Scheduler-Free Adaptation | |
| AdamS | AdamS: Momentum Itself Can Be A Normalizer for LLM Pretraining and Post-training | Link | Adaptive Learning Rate Methods | Stateless Adaptation | |
| LightSAM | LightSAM: Parameter-Agnostic Sharpness-Aware Minimization | Link | Loss Landscape Optimization | Sharpness-Aware Minimization (SAM) | |
| ADAGB2 | Fast Stochastic Second-Order Adagrad for Nonconvex Bound-Constrained Optimization | Link | Hybrid Methods | Projection Gradient Hybrid | |
| VRAdam | A Physics-Inspired Optimizer: Velocity Regularized Adam | Link | Momentum-Enhanced SGD | Momentum Damping Mechanism | |
| SKA-SGD | STREAMING KRYLOV-ACCELERATED STOCHASTIC GRADIENT DESCENT | Link | Loss Landscape Optimization | Curvature-Guided Landscape Exploration | |
| Adam-Power | GradPower: Powering Gradients for Faster Language Model Pre-Training | Link | Adaptive Learning Rate Methods | Second-Order Moment Adaptation | |
| AlphaGrad | AlphaGrad: Non-Linear Gradient Normalization Optimizer | Link | Adaptive Learning Rate Methods; Low-Memory Optimizer Design; Stateless Optimization Methods | Stateless Adaptation; Structural Redesign; Parameter Characteristic-Driven Updates | |
| AsyncSAM | ASYNCHRONOUS SHARPNESS-AWARE MINIMIZATION FOR FAST AND ACCURATE DEEP LEARNING | Link | Loss Landscape Optimization | Sharpness-Aware Minimization (SAM) | |
| ASGO | ASGO: Adaptive Structured Gradient Optimization | Link | Preconditioned Gradient Methods | Single Metric's Preconditioner | |
| AdaGC | AdaGC: Improving Training Stability for Large Language Model Pretraining | Link | Gradient Normalization & Clipping; Robust Optimization | Noise-Robust Normalization; Dynamic Gradient Clipping; Noise-Robust Gradients | |
| Adadiag | Improving Adaptive Moment Optimization viaPreconditioner Diagonalization | Link | Hybrid Methods | Projection Gradient Hybrid | |
| eagle | EAGLE: EARLY APPROXIMATED-GRADIENT-BASED LEARNING RATE ESTIMATOR | Link | Adaptive Learning Rate Methods | Momentum-based Adaptive | |
| Hessian-aware Scaling | First-ish Order Methods: Hessian-aware Scalings of Gradient Descent | Link | Preconditioned Gradient Methods | Single Metric's Preconditioner | |
| GCSAM | GCSAM: Gradient Centralized Sharpness Aware Minimization | Link | Gradient Normalization & Clipping | Mean-Removal Normalization | |
| SGDO | Overshoot: Taking advantage of future gradients in momentum-based stochastic optimization | Link | Momentum-Enhanced SGD | Accelerated Momentum | |
| μ²-SGD | DO STOCHASTIC, FEEL NOISELESS: STABLE STOCHASTIC OPTIMIZATION VIA A DOUBLE MOMENTUM MECHANISM | Link | Momentum-Enhanced SGD | Double-momentum mechanism | |
| Stable-SPAM | Stable-SPAM: How to Train in 4-Bit More Stably than 16-Bit Adam | Link | Gradient Normalization & Clipping; Privacy-Aware Gradient Clipping | Layer-Wise Gradient Normalization; Dynamic Gradient Clipping; Adaptive Clipping | |
| apollo | APOLLO:SGD-LIKE MEMORY, ADAMW-LEVEL PERFORMANCE | Link | Adaptive Learning Rate Methods; Towards LLM Traning | Hybrid Adaptive Strategy; Gradient Projection Mechanism | |
| SPAM | SPAM: SPIKE-AWARE ADAM WITH MOMENTUM RESET FOR STABLE LLM TRAINING | Link | Gradient Normalization & Clipping; Optimizer State Compression | Element-Wise Gradient Scaling; Spike-Aware Gradient Clipping; Sparse State Compression | |
| Coupled Adam | Better Embeddings with Coupled Adam | Link | Adaptive Learning Rate Methods | Layer-Wise Adaptation | |
| SWAN | SWAN: SGD WITH NORMALIZATION AND WHITENING ENABLES STATELESS LLM TRAINING | Link | Towards LLM Traning | Gradient Preconditioning Mechanism | |
| LDAdam | LDADAM: ADAPTIVE OPTIMIZATION FROM LOWDIMENSIONAL GRADIENT STATISTICS | Link | Hybrid Methods | Projection Gradient Hybrid | |
| SSAM | Stabilizing Sharpness-aware Minimization Through A Simple Renormalization Strategy | Link | Loss Landscape Optimization | Renormalized Gradient Norm SAM | |
| MARS | MARS: Unleashing the Power of Variance Reduction for Training Large Models | Link | Momentum-Enhanced SGD; Adaptive Learning Rate Methods | Double-momentum mechanism;Momentum-based Adaptive | |
| VSGD | Variational Stochastic Gradient Descent for Deep Neural Networks | Link | Adaptive Learning Rate Methods | Second-Order Moment Adaptation | |
| MIAdam | A Method for Enhancing Generalization of Adam by Multiple Integrations | Link | Loss Landscape Optimization | Curvature-Guided Landscape Exploration | |
| KOALA++ | KOALA++: Efficient Kalman-Based Optimization with Gradient-Covariance Products | Link | Adaptive Learning Rate Methods | Kalman filtering based | |
| PAdamP | ADAPTIVE MOMENT ESTIMATION OPTIMIZATION ALGORITHM USING PROJECTION GRADIENT FOR DEEP LEARNING | Link | Hybrid Methods | Projection Gradient Hybrid | |
| DecGD | A New Adaptive Gradient Method with Gradient Decomposition | Link | Learning Rate Scheduling | Loss-Sensitive Scheduling | |
| Grams | Grams: Gradient Descent with Adaptive Momentum Scaling | Link | Hybrid Methods | Multi-Objective Hybrid | |
| FSGDM | ON THE PERFORMANCE ANALYSIS OF MOMENTUM METHOD: A FREQUENCY DOMAIN PERSPECTIVE | Link | Momentum-Enhanced SGD | Frequency Domain Momentum Analysis | |
| AdEMAMix | THE ADEMAMIX OPTIMIZER:BETTER, FASTER, OLDER | Link | Momentum-Enhanced SGD | Double-momentum mechanism | |
| HVAdam | HVAdam: A Full-Dimension Adaptive Optimizer | Link | Hybrid Methods | Projection Gradient Hybrid | |
| SGD-SaI | No More Adam: Learning Rate Scaling at Initialization is All You Need | Link | Learning Rate Scheduling; Stateless Optimization Methods | Initial Learning Rate Scaling; Parameter Characteristic-Driven Updates | |
| Adam++ | Towards Simple and Provable Parameter-Free Adaptive Gradient Methods | Link | Learning Rate Scheduling | Scheduler-Free Adaptation | |
| EXADAM | EXADAM: THE POWER OF ADAPTIVE CROSS-MOMENTS | Link | Adaptive Learning Rate Methods | Hybrid Adaptive Strategy | |
| Cautious Optimizers | Cautious Optimizers: Improving Training with One Line of Code | Link | Momentum-Enhanced SGD | Momentum-Gradient Alignment | |
| AGS-GD | Anisotropic Gaussian Smoothing for Gradient-based Optimization | Link | Hybrid Methods; Auto-Designed Optimizers | Gradient Smoothing Hybrid; Automated Discovery&Theoretical Derivation | |
| CAdam | CAdam: Confidence-Based Optimization for Online Learning | Link | Hybrid Methods | Multi-Objective Hybrid | |
| INNAprop | A SECOND-ORDER-LIKE OPTIMIZER WITH ADAPTIVE GRADIENT SCALING FOR DEEP LEARNING | Link | Adaptive Learning Rate Methods | Momentum-based Adaptive | |
| CaAdam | CaAdam: Improving Adam optimizer using connection aware methods | Link | Adaptive Learning Rate Methods | Layer-Wise Adaptation | |
| BADM | BADM: Batch ADMM for Deep Learning | Link | Hybrid Methods | Multi-Objective Hybrid | |
| FAdam | FAdam: Adam is a natural gradient optimizer using diagonal empirical Fisher information | Link | Adaptive Learning Rate Methods; Hybrid Methods | Dynamic Epsilon Adjustment; Multi-Objective Hybrid | |
| MSAM | Momentum-SAM: Sharpness Aware Minimization without Computational Overhead | Link | Loss Landscape Optimization | Momentum Landscape Adaptation | |
| LOMO | Full Parameter Fine-tuning for Large Language Models with Limited Resources | Link | Towards LLM Traning; Memory-Efficient Fine-Tuning for Large Models | Real-Time Computation; Staless Fine-Tuning | |
| BAdam | BAdam: A Memory Efficient Full Parameter Optimization Method for Large Language Models | Link | Hybrid Methods; Towards LLM Traning | Multi-Objective Hybrid; Real-Time Computation; Block-Wise Computation | |
| DP-AdamBC | DP-AdamBC: Your DP-Adam Is Actually DP-SGD (Unless You Apply Bias Correction) | Link | Adaptive Learning Rate Methods | Second-Order Moment Adaptation | |
| Dice-SGD | DIFFERENTIALLY PRIVATE SGD WITHOUT CLIPPING BIAS: AN ERROR-FEEDBACK APPROACH | Link | Gradient Normalization & Clipping | DP-enhanced Gradient Clipping | |
| FESS-GDA | Stochastic Smoothed Gradient Descent Ascent for Federated Minimax Optimization | Link | Hybrid Methods | Gradient Filtering Hybrid | |
| AdaSAM | AdaSAM: Boosting Sharpness-Aware Minimization with Adaptive Learning Rate and Momentum for Training Deep Neural Networks | Link | Hybrid Methods | Multi-Objective Hybrid | |
| SAMPa | SAMPa: Sharpness-aware Minimization Parallelized | Link | Loss Landscape Optimization | Sharpness-Aware Minimization (SAM) | |
| Lookbehind-SAM: k steps back, 1 step forward | Link | Loss Landscape Optimization | Multi-Step Ascent SAM | ||
| F-SAM | Friendly Sharpness-Aware Minimization | Link | Loss Landscape Optimization | Noise Injection Enhancement | |
| FGSAM | Fast Graph Sharpness-Aware Minimization for Enhancing and Accelerating Few-Shot Node Classification | Link | Loss Landscape Optimization | Noise Injection Enhancement | |
| Adan | Adan: Adaptive Nesterov Momentum Algorithm for Faster Optimizing Deep Models | Link | Adaptive Learning Rate Methods | Momentum-based Adaptive | |
| 4-bit shampoo | 4-bit Shampoo for Memory-Efficient Network Training | Link | Preconditioned Gradient Methods; Low-Memory Optimizer Design | Two Metrics' Preconditioner; Compression&Approximation of States | |
| Muon | Blog'24 | Muon: An optimizer for hidden layers in neural networks | Link | Adaptive Learning Rate Methods | Layer-Wise Adaptation |
| ADOPT | ADOPT: Modified Adam Can Converge with Any β2 with the Optimal Rate | Link | Adaptive Learning Rate Methods | Second-Order Moment Adaptation | |
| SET-adam | On Suppressing Range of Adaptive Stepsizes of Adam to Improve Generalisation Performance | Link | Adaptive Learning Rate Methods; Hybrid Methods | Second-Order Moment Adaptation; Multi-Objective Hybrid | |
| Adam-Real | Adam on Local Time: Addressing Nonstationarity in RL with Relative Adam Timesteps | Link | Momentum-Enhanced SGD | Scheduled Momentum Reset | |
| SNGM | Stochastic Normalized Gradient Descent with Momentum for Large-Batch Training | Link | Momentum-Enhanced SGD | Momentum Damping Mechanism | |
| Schedule-Free | The Road Less Scheduled | Link | Learning Rate Scheduling | Scheduler-Free Adaptation | |
| AUTODROP | AUTODROP: TRAINING DEEP LEARNING MODELS WITH AUTOMATIC LEARNING RATE DROP | Link | Adaptive Learning Rate Methods; Learning Rate Scheduling | Stateless Adaptation; Scheduler-Free Adaptation | |
| ADAACT | AN ADAPTIVE METHOD STABILIZING ACTIVATIONS FOR ENHANCED GENERALIZATION | Link | Adaptive Learning Rate Methods; Optimizer State Compression | Neuron-Level Adaptation; State Sharing | |
| MoMo | MoMo: Momentum Models for Adaptive Learning Rates | Link | Adaptive Learning Rate Methods | Momentum-based Adaptive | |
| RSGDM | Reducing Bias in Deep Learning Optimization: The RSGDM Approach | Link | Momentum-Enhanced SGD | Accelerated Momentum | |
| NYSACT | NYSACT: A SCALABLE PRECONDITIONED GRADIENT DESCENT USING NYSTRÖM APPROXIMATION | Link | Preconditioned Gradient Methods | Single Metric's Preconditioner | |
| SGDF | Signal Processing Meets SGD: From Momentum to Filter | Link | Momentum-Enhanced SGD | Dynamic Momentum Weight | |
| AdaLOMO | AdaLomo: Low-memory Optimization with Adaptive Learning Rate | Link | Adaptive Learning Rate Methods | Second-Order Moment Adaptation | |
| SGD with Large Step Sizes Learns Sparse Features | Link | Learning Rate Scheduling | Stability-Aware Adaptive Scheduling | ||
| look around | Lookaround Optimizer: k steps around, 1 step average | Link | Loss Landscape Optimization | Weight Averaging | |
| GAM | Gradient Norm Aware Minimization Seeks First-Order Flatness and ImprovesGeneralization | Link | Loss Landscape Optimization | Curvature-Guided Landscape Exploration | |
| AE-SAM | AN ADAPTIVE POLICY TO EMPLOY SHARPNESS-AWARE MINIMIZATION | Link | Loss Landscape Optimization | Sharpness-Aware Minimization (SAM) | |
| Aida | A DNN Optimizer that Improves over AdaBelief by Suppression of the Adaptive Stepsize Range | Link | Adaptive Learning Rate Methods | Prediction Deviation Adaptation | |
| Lion | Symbolic Discovery of Optimization Algorithms | Link | Adaptive Learning Rate Methods; Auto-Designed Optimizers | Momentum-based Adaptive; Automated Discovery&Theoretical Derivation | |
| AdamMC | Moment Centralization based Gradient Descent Optimizers for Convolutional Neural Networks | Link | Gradient Normalization & Clipping | Mean-Removal Normalization | |
| MultiAdam | MultiAdam: Parameter-wise Scale-invariant Optimizer for Multiscale Training of Physics-informed Neural Networks | Link | Gradient Normalization & Clipping | Layer-Wise Gradient Normalization | |
| AdaNorm | AdaNorm: Adaptive Gradient Norm Correction based Optimizer for CNNs | Link | Gradient Normalization & Clipping; Robust Optimization | Element-Wise Gradient Scaling; Noise-Robust Normalization; Noise-Robust Gradients | |
| AGD | AGD: an Auto-switchable Optimizer using Stepwise Gradient Difference for Preconditioning Matrix | Link | Adaptive Learning Rate Methods | Second-Order Moment Adaptation | |
| RLEKF | RLEKF: An Optimizer for Deep Potential with Ab Initio Accuracy | Link | Adaptive Learning Rate Methods | Kalman filtering based | |
| Amos | Amos: AN ADAM-STYLE OPTIMIZER WITH ADAPTIVE WEIGHT DECAY TOWARDS MODEL-ORIENTED SCALE | Link | Learning Rate Scheduling | Scheduler-Free Adaptation | |
| AdaBFE | BFE and AdaBFE: A New Approach in Learning Rate Automation for Stochastic Optimization | Link | Learning Rate Scheduling; Stateless Optimization Methods | Gradient Angle Scheduling; Parameter Characteristic-Driven Updates | |
| DP-SGD | Normalized/Clipped SGD with Perturbation for Differentially Private Non-Convex Optimization | Link | Gradient Normalization & Clipping | Basic Fixed Gradient Clipping | |
| AdamFamily | AdaFamily: A family of Adam-like adaptive gradient methods | Link | Adaptive Learning Rate Methods | Hybrid Adaptive Strategy | |
| SRSGD | Scheduled Restart Momentum for Accelerated Stochastic Gradient Descent | Link | Momentum-Enhanced SGD | Scheduled Momentum Reset | |
| Step-Tuned SGD | Second-order step-size tuning of SGD for non-convex optimization | Link | Learning Rate Scheduling | Scheduler-Free Adaptation | |
| AEGDM | AN ADAPTIVE GRADIENT METHOD WITH ENERGY AND MOMENTUM | Link | Adaptive Learning Rate Methods | Stateless Adaptation | |
| AdaInject | AdaInject: Injection Based Adaptive Gradient Descent Optimizers for Convolutional Neural Networks | Link | Adaptive Learning Rate Methods | Momentum-based Adaptive | |
| KOALA | KOALA: A Kalman Optimization Algorithm with Loss Adaptivity | Link | Adaptive Learning Rate Methods | Kalman filtering based | |
| ESAM | EFFICIENT SHARPNESS-AWARE MINIMIZATION FOR IMPROVED TRAINING OF NEURAL NETWORKS | Link | Loss Landscape Optimization | Sharpness-Aware Minimization (SAM) | |
| MADGRAD | Adaptivity without Compromise: A Momentumized, Adaptive, Dual Averaged Gradient Method for Stochastic Optimization | Link | Momentum-Enhanced SGD; Hybrid Methods | Accelerated Momentum; Multi-Objective Hybrid | |
| GDA-AM | GDA-AM: On the effectiveness of solving minimax optimization via Anderson Acceleration | Link | Hybrid Methods | Multi-Objective Hybrid | |
| AdamD | AdamD: Improved bias-correction in Adam | Link | Adaptive Learning Rate Methods | Bias Correction Rules Adaptaion | |
| AdaL | AdaL: Adaptive Gradient Transformation Contributes to Convergences and Generalizations | Link | Adaptive Learning Rate Methods | Hybrid Adaptive Strategy | |
| AngularGrad | AngularGrad: A New Optimization Technique for Angular Convergence of Neural Networks | Link | Momentum-Enhanced SGD | Momentum-Gradient Alignment | |
| SGD-G2 | Stochastic Runge-Kutta methods and adaptive SGD-G2 stochastic gradient descent | Link | Learning Rate Scheduling | Scheduler-Free Adaptation | |
| SQuARM-SGD | SQuARM-SGD: Communication-Efficient Momentum SGD for Decentralized Optimization | Link | Momentum-Enhanced SGD; Local Update Strategies; Distributed Hybrid Optimization | Accelerated Momentum; Local SGD; Local Momentum Updates; Compression & Local Updates | |
| SAM | Sharpness-Aware Minimization for Efficiently Improving Generalization | Link | Loss Landscape Optimization | Sharpness-Aware Minimization (SAM) | |
| AvaGrad | Domain-independent Dominance of Adaptive Methods | Link | Adaptive Learning Rate Methods | Decoupled Learning Rate and Adaptability | |
| Madam | MaxVA: Fast Adaptation of Step Sizes by Maximizing Observed Variance of Gradients | Link | Adaptive Learning Rate Methods | Second-Order Moment Adaptation | |
| ACMo | ACMO: ANGLE-CALIBRATED MOMENT METHODS FOR STOCHASTIC OPTIMIZATION | Link | Learning Rate Scheduling | Gradient Angle Scheduling | |
| AdamP / SGDP | AdamP: Slowing Down the Slowdown for Momentum Optimizers on Scale-invariant Weights | Link | Momentum-Enhanced SGD; Hybrid Methods | Momentum Damping Mechanism; Projection Gradient Hybrid | |
| ACProp | Momentum Centering and Asynchronous Update for Adaptive Gradient Methods | Link | Adaptive Learning Rate Methods | Hybrid Adaptive Strategy | |
| Adam+ | Adam+: A Stochastic Method with Adaptive Variance Reduction | Link | Adaptive Learning Rate Methods | Momentum-based Adaptive | |
| EAdam | EAdam Optimizer: How ∈ Impact Adam | Link | Adaptive Learning Rate Methods | Dynamic Epsilon Adjustment | |
| AdaSGD | AdaSGD: Bridging the gap between SGD and Adam | Link | Adaptive Learning Rate Methods; Hybrid Methods | Hybrid Adaptive Strategy; SGD-Adam Hybrid | |
| ADAS | ADAS: ADAPTIVE SCHEDULING OF STOCHASTIC GRADIENTS | Link | Learning Rate Scheduling | Scheduler-Free Adaptation | |
| LaProp | LaProp: Separating Momentum and Adaptivity in Adam | Link | Adaptive Learning Rate Methods | ||
| Multistage SGDM | An Improved Analysis of Stochastic Gradient Descent with Momentum | Link | Learning Rate Scheduling | Stability-Aware Adaptive Scheduling | |
| pbSGD | pbSGD: Powered Stochastic Gradient Descent Methods for Accelerated Non-Convex Optimization | Link | Gradient Normalization & Clipping | Element-Wise Gradient Scaling | |
| clipped-SGD | Stochastic Optimization with Heavy-Tailed Noise via Accelerated Gradient Clipping | Link | Gradient Normalization & Clipping | Basic Fixed Gradient Clipping | |
| Cayley SGD | EFFICIENT RIEMANNIAN OPTIMIZATION ON THE STIEFEL MANIFOLD VIA THE CAYLEY TRANSFORM | Link | Hybrid Methods | Projection Gradient Hybrid | |
| NIGT | Momentum Improves Normalized SGD | Link | Gradient Normalization & Clipping | Noise-Robust Normalization | |
| AdaBelief | AdaBelief Optimizer: Adapting Stepsizes by the Belief in Observed Gradients | Link | Adaptive Learning Rate Methods | Second-Order Moment Adaptation; Prediction Deviation Adaptation | |
| RAdam | On the Variance of the Adaptive Learning Rate and Beyond | Link | Adaptive Learning Rate Methods | Momentum-based Adaptive | |
| AdamBS | Adam with Bandit Sampling for Deep Learning | Link | Hybrid Methods | Multi-Objective Hybrid | |
| DEAM | DEAM: Adaptive Momentum with Discriminative Weight for Stochastic Optimization | Link | Momentum-Enhanced SGD | Dynamic Momentum Weight | |
| LAMB | Large Batch Optimization for Deep Learning: Training BERT in 76 minutes | Link | Adaptive Learning Rate Methods; Learning Rate Scheduling | Layer-Wise Adaptation; Batch-Aware Scheduling | |
| ADASS | ADASS: Adaptive Sample Selection for Training Acceleration | Link | Hybrid Methods | Gradient Filtering Hybrid | |
| NovoGrad | Stochastic Gradient Methods with Layer-wise Adaptive Moments for Training of Deep Networks | Link | Adaptive Learning Rate Methods | Layer-Wise Adaptation | |
| AdamW / SGDW | Decoupled Weight Decay Regularization | Link | Adaptive Learning Rate Methods | ||
| QHadam | QUASI-HYPERBOLIC MOMENTUM AND ADAM FORDEEP FOR DEEP LEARNING | Link | Adaptive Learning Rate Methods | ||
| HAdam | On Higher-order Moments in Adam | Link | Adaptive Learning Rate Methods | Second-Order Moment Adaptation | |
| diffGrad | diffGrad: An Optimization Method for Convolutional Neural Networks | Link | Adaptive Learning Rate Methods | Hybrid Adaptive Strategy | |
| NosAdam | Nostalgic Adam: Weighting more of the past gradients when designing the adaptive learning rate | Link | Adaptive Learning Rate Methods | Second-Order Moment Adaptation | |
| Lookahead | Lookahead Optimizer: k steps forward, 1 step back | Link | Hybrid Methods | Multi-Objective Hybrid | |
| AdaBound | Adaptive Gradient Methods with Dynamic Bound of Learning Rate | Link | Learning Rate Scheduling | Element-Wise Learning Rate Scheduling | |
| LazyOptimizer | Blog'19 | Link | Momentum-Enhanced SGD | Scheduled Momentum Reset | |
| YOGI | Adaptive Methods for Nonconvex Optimization | Link | Momentum-Enhanced SGD | Double-momentum mechanism | |
| VR-SGD | VR-SGD: A Simple Stochastic Variance Reduction Method for Machine Learning | Link | Hybrid Methods | Gradient Filtering Hybrid | |
| Shampoo | Shampoo: Preconditioned Stochastic Tensor Optimization | Link | Preconditioned Gradient Methods | Two Metrics' Preconditioner | |
| MSVAG | DissectingAdam:TheSign,MagnitudeandVarianceofStochasticGradients | Link | Adaptive Learning Rate Methods | Second-Order Moment Adaptation | |
| PIDOptimizer | A PID Controller Approach for Stochastic Optimization of Deep Networks | Link | Momentum-Enhanced SGD | Momentum Damping Mechanism | |
| LARS | Large batch training of Convolutional Network | Link | Adaptive Learning Rate Methods | Layer-Wise Adaptation | |
| NAdam | Incorporating Nesterov Momentum into Adam | Link | Adaptive Learning Rate Methods | Momentum-based Adaptive | |
| Adam | ADAM: A METHOD FOR STOCHASTIC OPTIMIZATION | Link | Adaptive Learning Rate Methods | ||
| SGDM | On the importance of initialization and momentum in deep learning | Link | Momentum-Enhanced SGD | Accelerated Momentum | |
| AdaDelta | ADADELTA:ANADAPTIVELEARNINGRATEMETHOD | Link | Adaptive Learning Rate Methods | Second-Order Moment Adaptation | |
| AdaGrad | Adaptive Subgradient Methods for Online Learning and Stochastic Optimization | Link | Adaptive Learning Rate Methods | Second-Order Moment Adaptation |
⚙️ Second-Order Algorithms
| Abbreviation | Venue & Year | Paper Title | Project | Sub-methods | Fine-grained Methods |
|---|---|---|---|---|---|
| C-ALADIN | A Global Convergence Analysis of Consensus ALADIN for Convex Optimization | Link | Hessian Approximation & Estimation | Distributed Newton-Type Approximation | |
| S-BFGS | EFFICIENT STOCHASTIC BFGS METHODS INSPIRED BY BAYESIAN PRINCIPLES | Link | Quasi-Newton Methods | Stochastic BFGS | |
| MAC | MAC: AN EFFICIENT GRADIENT PRECONDITIONING USING MEAN ACTIVATION APPROXIMATED CURVATURE | Link | Fisher Information Matrix Application | Curvature-Aware Approximation | |
| RACS | Towards Efficient Optimizer Design for LLM via Structured Fisher Approximation with a Low-Rank Extension | Link | Fisher Information Matrix Application | Diagonal Fisher Approximation | |
| OCAR | Online Curvature-Aware Replay: Leveraging 2nd Order Information for Online Continual Learning | Link | Fisher Information Matrix Application | Diagonal Fisher Approximation | |
| FUSE-PV | FUSE: First-Order and Second-Order Unified SynthEsis in Stochastic Optimization | Link | Quasi-Newton Methods | Stochastic BFGS | |
| SASSHA | SASSHA: Sharpness-aware Adaptive Second-order Optimization with Stable Hessian Approximation | Link | Hessian Approximation & Estimation | Diagonal Hessian Approximation | |
| AdaFisher | ADAFISHER: ADAPTIVE SECOND ORDER OPTIMIZATION VIA FISHER INFORMATION | Link | Fisher Information Matrix Application | Diagonal Fisher Approximation; Block-Diagonal Kronecker Approximation | |
| OptiQ | Second-Order Optimization via Quiescence | Link | Curvature-Guided Preconditioning | Hessian Diagonal Preconditioning | |
| SOAA | EFFICIENT SECOND-ORDER NEURAL NETWORK OPTIMIZATION VIA ADAPTIVE TRUST REGION METHODS | Link | Fisher Information Matrix Application | Diagonal Fisher Approximation | |
| CRNAS | Novel Optimization Techniques for Parameter Estimation | Link | Hessian Approximation & Estimation | Diagonal Hessian Approximation | |
| Athena | Athena: Efficient Block-Wise Post-Training Quantization for Large Language Models Using Second-Order Matrix Derivative Information | Link | Hessian Approximation & Estimation | Block Hessian Approximation | |
| Q-Newton | Q-Newton: Hybrid Quantum-Classical Scheduling for Accelerating Neural Network Training with Newton’s Gradient Descent | Link | Hessian Approximation & Estimation | Block Hessian Approximation | |
| SkechySGD | SketchySGD: Reliable Stochastic Optimization via Randomized Curvature Estimates | Link | Hessian Approximation & Estimation | Stochastic Hessian Sampling | |
| sophia | Sophia: A Scalable Stochastic Second-order Optimizer for Language Model Pre-training | Link | Hessian Approximation & Estimation; Curvature-Guided Preconditioning; Second-Order Moment Fusion; Privacy-Aware Gradient Clipping; Stateless Optimization Methods | Diagonal Hessian Approximation; Hessian Diagonal Preconditioning; Noise-Robust Second-Order Momentum; Real-Time Curvature Estimation | |
| Fed-Sophia | Fed-Sophia: A Communication-Efficient Second-Order Federated Learning Algorithm | Link | Hessian Approximation & Estimation; Federated Learning Optimization | Diagonal Hessian Approximation; Federated Second-Order Optimization | |
| HesScale | Revisiting Scalable Hessian Diagonal Approximations for Applications in Reinforcement Learning | Link | Hessian Approximation & Estimation | Diagonal Hessian Approximation | |
| mL-BFGS | mL-BFGS: A Momentum-based L-BFGS for Distributed Large-Scale Neural Network Optimization | Link | Quasi-Newton Methods | Stochastic BFGS; Low-Memory Quasi-Newton | |
| SGDHess | Better SGD using Second-order Momentum | Link | Hessian Approximation & Estimation | Gradient Difference Estimation | |
| AdaHessian | AdaHessian: An Adaptive Second Order Optimizer for Machine Learning | Link | Hessian Approximation & Estimation; Second-Order Moment Fusion | Diagonal Hessian Approximation; Noise-Robust Second-Order Momentum | |
| TKFAC | A Trace-restricted Kronecker-Factored Approximation to Natural Gradient | Link | Fisher Information Matrix Application | Trace-Preserving Fisher Approximation | |
| SGN | On the Promise of the Stochastic Generalized Gauss-Newton Method for Training DNNs | Link | Hessian Approximation & Estimation | Stochastic Hessian Sampling | |
| SpiderSQN | A FAST QUASI-NEWTON-TYPE METHOD FOR LARGESCALE STOCHASTIC OPTIMISATION | Link | Quasi-Newton Methods | Stochastic BFGS | |
| K-BFGS and K-BFGS(L), | Practical Quasi-Newton Methods for Training Deep Neural Networks | Link | Quasi-Newton Methods | Low-Memory Quasi-Newton | |
| K-FAC | Optimizing Neural Networks with Kronecker-factored Approximate Curvature | Link | Fisher Information Matrix Application | Block-Diagonal Kronecker Approximation | |
| Natural Gradient | Natural gradient works efficiently in learning | Fisher Information Matrix Application | |||
| BFGS | A limited memory algorithm for bound constrained optimization | Quasi-Newton Methods | |||
| Newton's Method | ANL'1982 | Newton's method | Hessian Approximation & Estimation | ||
| L-BFGS | Updating quasi-newton matrices with limited storage | Quasi-Newton Methods | Stochastic BFGS | ||
| Gauss-Newton Method | Quasi-Likelihood Functions, Generalized Linear Models, and the Gauss-Newton Method | Hessian Approximation & Estimation |
📍 Zeroth-Order Algorithms
| Abbreviation | Venue & Year | Paper Title | Project | Sub-methods | Fine-grained Methods |
|---|---|---|---|---|---|
| AdaMeZO | AdaMeZO: Adam-style Zeroth-Order Optimizer for LLM Fine-tuning Without Maintaining the Moments | Link | Adaptive Methods | ||
| MEAZO | On Adaptivity in Zeroth-Order Optimization | Link | Memory-Efficient Methods | ||
| ZO-SAH | Subspace-based Approximate Hessian Method for Zeroth-Order Optimization | Link | Adaptive Methods | Projection-based Adaptive | |
| FZOO | FZOO: Fast Zeroth-Order Optimizer for Fine-Tuning Large Language Models towards Adam-Scale Speed | Link | Variance Reduction | Structured Variance Control | |
| VR-SZD | A Structured Proximal Stochastic Variance Reduced Zeroth-order Algorithm | Link | Variance Reduction | Snapshot Variance Reduction | |
| KerZOO | KerZOO: Kernel Function Informed Zeroth-Order Optimization for Accurate and Accelerated LLM Fine-Tuning | Link | Memory-efficient Methods | Inference-Level Memory Zeroth-Order | |
| QZO | Fine-tuning Quantized Neural Networks with Zeroth-order Optimization | Link | Memory-efficient Methods | Quantized Zeroth-Order Finetuning | |
| VAMO | VAMO: Efficient Large-Scale Nonconvex Optimization via Adaptive Zeroth Order Variance Reduction | Link | Zeroth-First Order Hybrid; Variance Reduction | Variance Reduction Hybrid; Snapshot Variance Reduction | |
| ZO2 | ZO2: Scalable Zeroth-Order Fine-Tuning for Extremely Large Language Models with Limited GPU Memory | Link | Memory-efficient Methods | Inference-Level Memory Zeroth-Order | |
| LORENZA | LORENZA: Enhancing Generalization in Low-Rank Gradient LLM Training and Fine-Tuning via Efficient Zeroth-Order Adaptive SAM Optimization | Link | Adaptive Methods | Momentum-based Adaptive; Low-Rank Adaptive | |
| QuZO | QuZO: Quantized Zeroth-Order Fine-Tuning for Large Language Models | Link | Memory-efficient Methods; Low-Rank Methods | Quantized Zeroth-Order Finetuning; Low-Rank & Quantization | |
| MaZO | MaZO: Masked Zeroth-Order Optimization for Multi-Task Fine-Tuning of Large Language Models | Link | Memory-efficient Methods | Sparse Parameter Zeroth-Order | |
| DiZO | Harmony in Divergence: Towards Fast, Accurate, and Memory-efficient Zeroth-order LLM Fine-tuning | Link | Adaptive Methods | Projection-based Adaptive | |
| TeZO | TeZO: Empowering the Low-Rankness on the Temporal Dimension in the Zeroth-Order Optimization for Fine-tuning LLMs | Link | Memory-efficient Methods | Low-Rank Zeroth-Order Finetuning | |
| ELASTICZO | ELASTICZO: A MEMORY-EFFICIENT ON-DEVICE LEARNING WITH COMBINED ZEROTH- AND FIRST-ORDER OPTIMIZATION | Link | Zeroth-First Order Hybrid | Layer-Wise Hybrid | |
| LOZO | Enhancing zeroth-order fine-tuning for language models with low-rank structures | Link | Memory-efficient Methods | Low-Rank Zeroth-Order Finetuning | |
| Addax | Addax: Utilizing Zeroth-Order Gradients to Improve Memory Efficiency and Performance of SGD for Fine-Tuning Language Models | Link | Zeroth-First Order Hybrid | Weighted Hybrid | |
| ZOQO | ZOQO: Zero-Order Quantized Optimization | Link | Memory-efficient Methods | Quantized Zeroth-Order Finetuning | |
| R-AdaZO | Refining Adaptive Zeroth-Order Optimization at Ease | Link | Adaptive Method | Momentum-based Adaptive | |
| ZO-AdaMM | Zeroth-Order Adaptive Momentum Method for Black-Box Optimization | Link | Adaptive Methods | Momentum-based Adaptive | |
| LeZO | SIMULTANEOUS COMPUTATION AND MEMORY EFFICIENT ZEROTH-ORDER OPTIMIZER FOR FINE-TUNING LARGE LANGUAGE MODELS | Link | Perturbation Optimization; Memory-efficient Methods; Memory-Efficient Fine-Tuning for Large Models | Sparse Perturbation; Sparse Parameter Zeroth-Order; Selective Parameter Fine-Tuning | |
| SuZero | Zeroth-Order Fine-Tuning of LLMs in Random Subspaces | Link | Memory-efficient Methods | Low-Rank Zeroth-Order Finetuning | |
| Sparse MeZO | Sparse MeZO: Less Parameters for Better Performance in Zeroth-Order LLM Fine-Tuning | Link | Perturbation Optimization; Memory-efficient Methods; Memory-Efficient Fine-Tuning for Large Models | Sparse Perturbation; Sparse Parameter Zeroth-Order; Selective Parameter Fine-Tuning | |
| MeZO-SVRG | Variance-reduced Zeroth-Order Methods for Fine-Tuning Language Models | Link | Variance Reduction | Snapshot Variance Reduction | |
| ZO-AdaMU | ZO-AdaMU Optimizer: Adapting Perturbation by the Momentum and Uncertainty in Zeroth-order Optimization | Link | Perturbation Optimization; Memory-efficient Methods | Paired Perturbation Sampling; Inference-Level Memory Zeroth-Order | |
| ZoPro | A Zeroth-Order Proximal Algorithm for Consensus Optimization | Link | Distributed Zero-Order Optimization | Distributed Perturbation Sampling | |
| MeZO | Fine-Tuning Language Models with Just Forward Passes | Link | Memory-efficient Methods | Inference-Level Memory Zeroth-Order | |
| TOP-DP | Topology-aware Differential Privacy for Decentralized Image Classification | Link | Distributed Zero-Order Optimization; Differential Privacy Optimization | Privacy-Preserving Zeroth-Order; DP-SGD Variants; Dynamic Noise Scheduling; Privacy-Utility Balance | |
| SPSA | Global random optimization by simultaneous perturbation stochastic approximation | Perturbation Optimization | Paired Perturbation Sampling |
🌐 Distributed Optimization
| Abbreviation | Venue & Year | Paper Title | Project | Sub-methods | Fine-grained Methods |
|---|---|---|---|---|---|
| AlignFed | AlignFed: Alignment-Aware Asynchronous Federated Fine-Tuning for Large Language Models in Heterogeneous Edge Environments | Link | Federated Learning Optimization | Asynchronous Federated Aggregation | |
| DECA | DECA: Decentralizing Block-Wise Adam for Efficient LLM Full-Parameter Fine-Tuning on Non-IID Data | Link | Federated Learning Optimization | Block-Wise Decentralized Adam | |
| FedSIR | FedSIR: Spectral Client Identification and Relabeling for Federated Learning with Noisy Labels | Link | Federated Learning Optimization | ||
| Ringleader ASGD | First Provably Optimal Asynchronous SGD for Homogeneous and Heterogeneous Data | Link | Local Update Strategies | Local-global hybrid updates | |
| FedMuon | FedMuon: Accelerating Federated Learning with Matrix Orthogonalization | Link | Federated Learning Optimization | Federated Momentum Fusion | |
| DLAS-R-FTC | Distributed Optimization and Learning for Automated Stepsize Selection with Finite Time Coordination | Link | Decentralized Communication | Distributed Consensus Optimization | |
| DOME | Communication Efficient, Differentially Private Distributed Optimization using Correlation-Aware Sketching | Link | Gradient Compression & Quantization | Low-Rank Gradient Compression | |
| Deco-SGD | DeCo-SGD: Joint Optimization of Delay Staleness and Gradient Compression Ratio for Distributed SGD | Link | Gradient Compression & Quantization; Local Update Strategies | Adaptive Compression Level; Local-Global Hybrid Updates | |
| TAH-QUANT | TAH-QUANT: Effective Activation Quantization in Pipeline Parallelism over Slow Network | Link | Gradient Compression & Quantization | Quantization Compression | |
| LQ-SGD | Trustworthy Efficient Communication for Distributed Learning using LQ-SGD Algorithm | Link | Gradient Compression & Quantization; Low-Rank Methods | Quantization Compression; Low-Rank & Quantization | |
| FedCurv | Blockchain-Enabled Privacy-Preserving Second-Order Federated Edge Learning in Personalized Healthcare | Link | Federated Learning Optimization | Federated Second-Order Optimization | |
| pFedSOP | pFedSOP : Accelerating Training Of Personalized Federated Learning Using Second-Order Optimization | Link | Federated Learning Optimization | Federated Second-Order Optimization | |
| FedOne | FedOne: Query-Efficient Federated Learning for Black-box Discrete Prompt Learning | Link | Federated Learning Optimization | Client Sampling Optimization | |
| DEC-LOC | DES-LOC: Desynced Low Communication Adaptive Optimizers for Training Foundation Models | Link | Decentralized Communication | Distributed Consensus Optimization | |
| Kuramoto-FedAvg | Kuramoto-FedAvg: Using Synchronization Dynamics to Improve Federated Learning Optimization under Statistical Heterogeneity | Link | Federated Learning Optimization | Federated Momentum Fusion | |
| AbsSADMM | Stochastic ADMM with batch size adaptation for nonconvex nonsmooth optimization | Link | Local Update Strategies | Adaptive Local Steps | |
| ADEF | Accelerated Distributed Optimization with Compression and Error Feedback | Link | Gradient Compression & Quantization | Compression Error Compensation | |
| FedCET | Communication Efficient Federated Learning with Linear Convergence on Heterogeneous Data | Link | Local Update Strategies | Adaptive Local Steps | |
| Interleaved-ShuffleG | The Cost of Shuffling in Private Gradient Based Optimization | Link | Decentralized Communication; Differential Privacy Optimization | Privacy-Preserving Decentralization; Privacy-Utility Balance | |
| FAdamGC | Gradient Correction in Federated Learning with Adaptive Optimization | Link | Federated Learning Optimization | Client Sampling Optimization | |
| LT-ADMM | Communication-Efficient Stochastic Distributed Learning | Link | Decentralized Communication | Distributed Consensus Optimization | |
| HybridSGD | Communication-Efficient, 2D Parallel Stochastic Gradient Descent for Distributed-Memory Optimization | Link | Local Update Strategies | Local-Global Hybrid Updates | |
| DAT-SGD | Enhancing Parallelism in Decentralized Stochastic Convex Optimization | Link | Decentralized Communication | Neighbor Communication Topology | |
| FedSTaS | FedSTaS: Client Stratification and Client Level Sampling for Efficient Federated Learning | Link | Federated Learning Optimization | Client Sampling Optimization | |
| FedIvon | Federated Learning with Uncertainty and Personalization via Efficient Second-order Optimization | Link | Federated Learning Optimization | Personalized Federated Optimization | |
| FAGH | FAGH: Accelerating Federated Learning with Approximated Global Hessian | Link | Federated Learning Optimization | Federated Second-Order Optimization | |
| AdaFedAdam | ACCELERATING FAIR FEDERATED LEARNING: ADAPTIVE FEDERATED ADAM | Link | Federated Learning Optimization | Federated Momentum Fusion | |
| MM-PSGD | Distributed Optimization over Block-Cyclic Data | Link | Federated Learning Optimization | Personalized Federated Optimization | |
| MC-PSGD | Distributed Optimization over Block-Cyclic Data | Link | Federated Learning Optimization | Personalized Federated Optimization | |
| FedLion | FEDLION: FASTER ADAPTIVE FEDERATED OPTIMIZATION WITH FEWER COMMUNICATION | Link | Federated Learning Optimization | Federated Momentum Fusion | |
| FADAS | FADAS: Towards Federated Adaptive Asynchronous Optimization | Link | Federated Learning Optimization | Federated Momentum Fusion | |
| FLeNS | FLeNS: Federated Learning with Enhanced Nesterov-Newton Sketch | Link | Federated Learning Optimization | Federated Momentum Fusion | |
| FedRepOpt | FedRepOpt: Gradient Re-parametrized Optimizers in Federated Learning | Link | Federated Learning Optimization | Federated Momentum Fusion | |
| Fed-Sophia | Fed-Sophia: A Communication-Efficient Second-Order Federated Learning Algorithm | Link | Hessian Approximation & Estimation; Federated Learning Optimization | Diagonal Hessian Approximation; Federated Second-Order Optimization | |
| FedLAP-DP | FedLAP-DP: Federated Learning by Sharing Differentially Private Loss Approximations | Link | Decentralized Communication | Privacy-Preserving Decentralization | |
| AdaCGD | Adaptive Compression for Communication-Efficient Distributed Training | Link | Gradient Compression & Quantization | Adaptive Compression Level | |
| 0/1 Adam | Maximizing Communication Efficiency for Large-scale Training via 0/1 Adam | Link | Gradient Compression & Quantization | Quantization Compression | |
| SketchedAMSGrad | Communication-Efficient Adam-Type Algorithms for Distributed Data Mining | Link | Gradient Compression & Quantization | Low-Rank Gradient Compression | |
| SPARQ-SGD | SPARQ-SGD: Event-Triggered and Compressed Communication in Decentralized Stochastic Optimization | Link | Gradient Compression & Quantizatio | Sparsification Compression | |
| 1-bit Adam | 1-bit Adam: Communication Efficient Large-Scale Training with Adam’s Convergence Speed | Link | Gradient Compression & Quantization | Quantization Compression | |
| BVR-L-SGD | Bias-Variance Reduced Local SGD for Less Heterogeneous Federated Learning | Link | Local Update Strategies | Local SGD | |
| A(DP)^2SGD | A(DP)^2SGD: Asynchronous Decentralized Parallel Stochastic Gradient Descent with Differential Privacy | Link | Decentralized Communication | Neighbor Communication Topology | |
| SQuARM-SGD | SQuARM-SGD: Communication-Efficient Momentum SGD for Decentralized Optimization | Link | Momentum-Enhanced SGD; Local Update Strategies; Distributed Hybrid Optimization | Accelerated Momentum; Local SGD; Local Momentum Updates; Compression & Local Updates | |
| DLCP | Domain-specific Communication Optimization for Distributed DNN Training | Link | Local Update Strategies | Local SGD | |
| APMSqueeze | APMSqueeze: A Communication Efficient Adam-Preconditioned Momentum SGD Algorithm | Link | Gradient Compression & Quantization | Compression Error Compensation | |
| DEED-GD | DEED: A General Quantization Scheme for Communication Efficiency in Bits | Link | Gradient Compression & Quantization | Quantization Compression | |
| DP-PASGD | Differentially Private Federated Learning for Resource-Constrained Internet of Things | Link | Local Update Strategies | Adaptive Local Steps | |
| LAGS-SGD | Layer-wise Adaptive Gradient Sparsification for Distributed Deep Learning with Convergence Guarantees | Link | Gradient Compression & Quantization | Adaptive Compression Level | |
| FedAC | Federated Accelerated Stochastic Gradient Descent | Link | Federated Learning Optimization | Federated Momentum Fusion | |
| rTop-k | rTop-k: A Statistical Estimation Approach to Distributed SGD | Link | Gradient Compression & Quantization | Sparsification Compression | |
| Qsparse-local-SGD | Qsparse-local-SGD: Distributed SGD with Quantization, Sparsification, and Local Computations | Link | Distributed Hybrid Optimization | Compression & Local Updates | |
| SLOWMO | SLOWMO: IMPROVING COMMUNICATION-EFFICIENT DISTRIBUTED SGD WITH SLOW MOMENTUM | Link | Local Update Strategies | Local SGD | |
| SCAFFOLD | SCAFFOLD: Stochastic Controlled Averaging for Federated Learning | Link | Local Update Strategies | Local SGD | |
| LD-SGD | Communication-Efficient Local Decentralized SGD Methods | Link | Decentralized Communication | Neighbor Communication Topology | |
| PowerSGD | PowerSGD: Practical Low-Rank Gradient Compression for Distributed Optimization | Link | Gradient Compression & Quantization | Low-Rank Gradient Compression | |
| signProx | signProx: One-Bit Proximal Algorithm for Nonconvex Stochastic Optimization | Link | Gradient Compression & Quantization | Quantization Compression | |
| signSGD | signSGD: Compressed Optimisation for Non-Convex Problems | Link | Gradient Compression & Quantization | Quantization Compression |
🛡️ Privacy-Preserving Optimization
| Abbreviation | Venue & Year | Paper Title | Project | Sub-methods | Fine-grained Methods |
|---|---|---|---|---|---|
| PrivCode++ | PrivCode++: Latent-Conditioned Differentially Private Code Generation for Comprehensive Guarantees | Link | Differential Privacy Optimization | DP Fine-Tuning for Code Generation | |
| DP-MacAdam | DP-MacAdam: Differentially Private Mechanism with Adaptive Clipping and Adaptive Momentum | Link | Differential Privacy Optimization | Adaptive Clipping and Momentum | |
| DPSR-CG | Revisiting Privacy Amplification by Subsampling in Selective Release DPSGD | Link | Differential Privacy Optimization | Selective Release DP-SGD | |
| PINA | Differentially Private Clustered Federated Learning with Privacy-Preserving Initialization and Normality-Driven Aggregation | Link | Differential Privacy Optimization | ||
| DP-aware AdaLN-Zero | DP-aware AdaLN-Zero: Taming Conditioning-Induced Heavy-Tailed Gradients in Differentially Private Diffusion | Link | Differential Privacy Optimization | Dynamic noise scheduling | |
| DP-λCGD | DP-λCGD: Efficient Noise Correlation for Differentially Private Model Training | Link | Differential Privacy Optimization | DP-SGD variants | |
| RaCO-DP | Private Rate-Constrained Optimization with Applications to Fair Learning | Link | Differential Privacy Optimization | DP-SGD Variants | |
| Interleaved-ShuffleG | The Cost of Shuffling in Private Gradient Based Optimization | Link | Decentralized Communication; Differential Privacy Optimization | Privacy-Preserving Decentralization; Privacy-Utility Balance | |
| DPZV | DPZV: Elevating the Tradeoff between Privacy and Utility in Zeroth-Order Vertical Federated Learning | Link | Gradient Noise Injection | Noise-Robust Optimization | |
| Stable-SPAM | Stable-SPAM: How to Train in 4-Bit More Stably than 16-Bit Adam | Link | Gradient Normalization & Clipping; Privacy-Aware Gradient Clipping | Layer-Wise Gradient Normalization; Dynamic Gradient Clipping; Adaptive Clipping | |
| GeoDP-SGD | Analyzing and Optimizing Perturbation of DP-SGD Geometrically | Link | Gradient Noise Injection | Noise-Robust Optimization | |
| Logit-DP | DIFFERENTIALLY PRIVATE OPTIMIZATION FOR NONDECOMPOSABLE OBJECTIVE FUNCTIONS | Link | Privacy-Aware Gradient Clipping | Global Gradient Clipping | |
| DOPPLER | DOPPLER: Differentially Private Optimizers with Low-pass Filter for Privacy Noise Reduction | Link | Differential Privacy Optimization | DP-SGD Variants | |
| DC-SGD | DC-SGD: Differentially Private SGD with Dynamic Clipping through Gradient Norm Distribution Estimation | Link | Differential Privacy Optimization; Gradient Noise Injection; Privacy-Utility Tradeoff; Privacy-Aware Gradient Clipping | Dynamic Noise Scheduling; Dynamic Clipping Threshold; Adaptive Clipping | |
| SPARTA | SPARTA: An Optimization Framework for Differentially Private Sparse Fine-Tuning | Link | Differential Privacy Optimization | DP-SGD Variants | |
| DP-AdamW-BC | DP-AdamW: Investigating Decoupled Weight Decay and Bias Correction in Private Deep Learning | Link | Differential Privacy Optimization | DP-SGD Variants | |
| DP-MicroAdam | DP-MicroAdam: Private and Frugal Algorithm for Training and Fine-tuning | Link | Differential Privacy Optimization | DP-SGD Variants | |
| AClipped-dpSGD | Efficient Private SCO for Heavy-Tailed Data via Averaged Clipping | Link | Privacy-Aware Gradient Clipping | Global Gradient Clipping | |
| sophia | Sophia: A Scalable Stochastic Second-order Optimizer for Language Model Pre-training | Link | Hessian Approximation & Estimation; Curvature-Guided Preconditioning; Second-Order Moment Fusion; Privacy-Aware Gradient Clipping; Stateless Optimization Methods | Diagonal Hessian Approximation; Hessian Diagonal Preconditioning; Noise-Robust Second-Order Momentum; Real-Time Curvature Estimation | |
| ANSGD | Learning across Data Owners with Joint Differential Privacy | Link | Differential Privacy Optimization | DP-SGD Variants | |
| DP-Adam | DP-ADAM: CORRECTING DP BIAS IN ADAM’S SECOND MOMENT ESTIMATION | Link | Differential Privacy Optimization; Gradient Noise Injection | DP-SGD Variants; Noise-Robust Optimization | |
| DP-FedSAM | Make Landscape Flatter in Differentially Private Federated Learning | Link | Gradient Noise Injection; Federated Privacy Enhancement | Noise-Robust Optimization; Federated Noise Aggregation | |
| DPIS | DPIS: An Enhanced Mechanism for Differentially Private SGD with Importance Sampling | Link | Differential Privacy Optimization | DP-SGD Variants | |
| DP-SGD-JL | Fast and Memory Efficient Differentially Private-SGD via JL Projections | Link | Differential Privacy Optimization | DP-SGD Variants | |
| TOP-DP | Topology-aware Differential Privacy for Decentralized Image Classification | Link | Distributed Zero-Order Optimization; Differential Privacy Optimization | Privacy-Preserving Zeroth-Order; DP-SGD Variants; Dynamic Noise Scheduling; Privacy-Utility Balance | |
| DP-LSSGD | DP-LSSGD: A Stochastic Optimization Method to Lift the Utility in Privacy-Preserving ERM | Link | Privacy-Utility Tradeoff | Post-Processing Optimization |
⚡ Memory-Efficient Optimization
| Abbreviation | Venue & Year | Paper Title | Project | Sub-methods | Fine-grained Methods |
|---|---|---|---|---|---|
| DL-ZO | Dominant-Layer ZO: A Single Layer Dominates Zeroth-Order Fine-Tuning of LLMs | Link | Memory-efficient Methods | Layer-Selective Fine-Tuning | |
| LQ-SGD | Trustworthy Efficient Communication for Distributed Learning using LQ-SGD Algorithm | Link | Gradient Compression & Quantization; Low-Rank Methods | Quantization Compression; Low-Rank & Quantization | |
| SUMO | SUMO: Subspace-Aware Moment-Orthogonalization for Accelerating Memory-Efficient LLM Training | Link | Low-Rank Gradient Storage | Gradient Low-Rank Projection | |
| AlphaGrad | AlphaGrad: Non-Linear Gradient Normalization Optimizer | Link | Adaptive Learning Rate Methods; Low-Memory Optimizer Design; Stateless Optimization Methods | Stateless Adaptation; Structural Redesign; Parameter Characteristic-Driven Updates | |
| QuZO | QuZO: Quantized Zeroth-Order Fine-Tuning for Large Language Models | Link | Memory-efficient Methods; Low-Rank Methods | Quantized Zeroth-Order Finetuning; Low-Rank & Quantization | |
| GWT | Wavelet Meets Adam: Compressing Gradients for Memory-Efficient Training | Link | Low-Rank Gradient Storage | Gradient Low-Rank Projection | |
| AdaRankGrad | Adarankgrad: Adaptive gradient-rank and moments for memory-efficient llms training and fine-tuning | Link | Low-Rank Methods | Projection&Adjustment | |
| Adam-mini | ADAM-MINI: USE FEWER LEARNING RATES TO GAIN MORE | Link | Low-Memory Optimizer Design | Structural Redesign | |
| SPAM | SPAM: SPIKE-AWARE ADAM WITH MOMENTUM RESET FOR STABLE LLM TRAINING | Link | Gradient Normalization & Clipping; Optimizer State Compression | Element-Wise Gradient Scaling; Spike-Aware Gradient Clipping; Sparse State Compression | |
| SGD-SaI | No More Adam: Learning Rate Scaling at Initialization is All You Need | Link | Learning Rate Scheduling; Stateless Optimization Methods | Initial Learning Rate Scaling; Parameter Characteristic-Driven Updates | |
| A-GNB | HELENE: HESSIAN LAYER-WISE CLIPPING AND GRADIENT ANNEALING FOR ACCELERATING FINETUNING LLM WITH ZEROTH-ORDER OPTIMIZATION | Link | Low-Rank Gradient Storage | Dynamic Gradient Rank | |
| LeZO | SIMULTANEOUS COMPUTATION AND MEMORY EFFICIENT ZEROTH-ORDER OPTIMIZER FOR FINE-TUNING LARGE LANGUAGE MODELS | Link | Perturbation Optimization; Memory-efficient Methods; Memory-Efficient Fine-Tuning for Large Models | Sparse Perturbation; Sparse Parameter Zeroth-Order; Selective Parameter Fine-Tuning | |
| Adapprox | Adapprox: Adaptive Approximation in Adam Optimization via Randomized Low-Rank Matrices | Link | Low-Rank Methods | Projection&Adjustment | |
| Sparse MeZO | Sparse MeZO: Less Parameters for Better Performance in Zeroth-Order LLM Fine-Tuning | Link | Perturbation Optimization; Memory-efficient Methods; Memory-Efficient Fine-Tuning for Large Models | Sparse Perturbation; Sparse Parameter Zeroth-Order; Selective Parameter Fine-Tuning | |
| LOMO | Full Parameter Fine-tuning for Large Language Models with Limited Resources | Link | Towards LLM Traning; Memory-Efficient Fine-Tuning for Large Models | Real-Time Computation; Staless Fine-Tuning | |
| MICROADAM | MICROADAM: Accurate Adaptive Optimization with Low Space Overhead and Provable Convergence | Link | Optimizer State Compression | Sparse State Compression | |
| 4-bit shampoo | 4-bit Shampoo for Memory-Efficient Network Training | Link | Preconditioned Gradient Methods; Low-Memory Optimizer Design | Two Metrics' Preconditioner; Compression&Approximation of States | |
| ADAACT | AN ADAPTIVE METHOD STABILIZING ACTIVATIONS FOR ENHANCED GENERALIZATION | Link | Adaptive Learning Rate Methods; Optimizer State Compression | Neuron-Level Adaptation; State Sharing | |
| sophia | Sophia: A Scalable Stochastic Second-order Optimizer for Language Model Pre-training | Link | Hessian Approximation & Estimation; Curvature-Guided Preconditioning; Second-Order Moment Fusion; Privacy-Aware Gradient Clipping; Stateless Optimization Methods | Diagonal Hessian Approximation; Hessian Diagonal Preconditioning; Noise-Robust Second-Order Momentum; Real-Time Curvature Estimation | |
| tpSGD | Learning with Local Gradients at the Edge | Link | Low-Memory Optimizer Design | Structural Redesign | |
| AdaBFE | BFE and AdaBFE: A New Approach in Learning Rate Automation for Stochastic Optimization | Link | Learning Rate Scheduling; Stateless Optimization Methods | Gradient Angle Scheduling; Parameter Characteristic-Driven Updates | |
| Adafactor | Adafactor: Adaptive Learning Rates with Sublinear Memory Cost | Link | Low-Memory Optimizer Design; Optimizer State Compression | Compression&Approximation of States; State Sharing |
🧩 Tailored Optimization Approaches
| Abbreviation | Venue & Year | Paper Title | Project | Sub-methods | Fine-grained Methods |
|---|---|---|---|---|---|
| MAdam | MAdam: Metric-Aware Multi-Objective Adam | Link | Hybrid Methods | Multi-Objective Adaptive Strategy | |
| KO | KO: Kinetics-inspired Neural Optimizer with PDE Simulation Approaches | Link | Auto-Designed Optimizers | Automated Discovery&Theoretical Derivation | |
| BC-ADMM | BC-ADMM: An Efficient Non-convex Constrained Optimizer with Robotic Applications | Link | Robust Optimization | Structure-Aware Optimization | |
| AdaGC | AdaGC: Improving Training Stability for Large Language Model Pretraining | Link | Gradient Normalization & Clipping; Robust Optimization | Noise-Robust Normalization; Dynamic Gradient Clipping; Noise-Robust Gradients | |
| AGS-GD | Anisotropic Gaussian Smoothing for Gradient-based Optimization | Link | Hybrid Methods; Auto-Designed Optimizers | Gradient Smoothing Hybrid; Automated Discovery&Theoretical Derivation | |
| WarpAdam | WarpAdam: A new Adam optimizer based on Meta-Learning approach | Link | Auto-Designed Optimizers | Evolutionary Strategies&Meta-Adaptive Learning | |
| MADA | MADA: Meta-Adaptive Optimizers through hyper-gradient Descent | Link | Auto-Designed Optimizers | Evolutionary Strategies&Meta-Adaptive Learning | |
| Lion | Symbolic Discovery of Optimization Algorithms | Link | Adaptive Learning Rate Methods; Auto-Designed Optimizers | Momentum-based Adaptive; Automated Discovery&Theoretical Derivation | |
| AdaNorm | AdaNorm: Adaptive Gradient Norm Correction based Optimizer for CNNs | Link | Gradient Normalization & Clipping; Robust Optimization | Element-Wise Gradient Scaling; Noise-Robust Normalization; Noise-Robust Gradients | |
| BGADAM | BGADAM: Boosting based Genetic-Evolutionary ADAM for Neural Network Optimization | Link | Auto-Designed Optimizers | Evolutionary Strategies&Meta-Adaptive Learning | |
| HyperAdam | HyperAdam: A Learnable Task-Adaptive Adam for Network Training | Link | Auto-Designed Optimizers | Evolutionary Strategies&Meta-Adaptive Learning | |
| GADAM | GADAM: Genetic-Evolutionary ADAM for Deep Neural Network Optimization | Link | Auto-Designed Optimizers | Evolutionary Strategies&Meta-Adaptive Learning |
🔬Future Prospect
Challenges
- ⚖️ Multi-objective Trade-off Bottlenecks: Optimizing large models often forces a choice between convergence speed, memory efficiency, and distributed scaling, which can sacrifice generalization, introduce latency, or exacerbate instability. The core challenge is breaking these interconnected bottlenecks under a unified framework.
- 💾 Memory and Computational Overheads: Dense optimizer states create severe memory bottlenecks. Structural approximations, such as matrix inversions, significantly decrease global efficiency, and per-step computational latency often negates theoretical advantages in iteration count.
- 🔊 Noise Amplification and Estimation Variance: Anisotropic loss landscapes heavily amplify stochastic mini-batch noise. Random perturbations used for directional gradients suffer from approximation variance that scales poorly with dimensionality, and privacy-preserving noise degrades gradient fidelity.
Trends
- 🤖 Automated Symbolic Discovery: Shifting from fragile heuristic tuning to the automated generation of architecture-specific optimizers, enabling models to inherently navigate complex loss landscapes without manual intervention.
- 🧮 Preconditioning and Orthogonalization: Leveraging structural gradient statistics for preconditioning or matrix orthogonalization (e.g., Kron and Muon) to overcome the representational bottlenecks of simple diagonal scaling and open novel parameter space pathways.
Opptunities
- 🧩 Deep Integration of Multi-Order Algorithms: Moving beyond isolated algorithmic improvements by deeply integrating FO, SO, and ZO algorithms. The fundamental focus will shift from minimizing iteration complexity to improving global wall-clock efficiency.
- 🧭 Dimensionality-Robust Subspace Projection: Discovering advanced subspace projection methods that safely constrain massive search spaces without prematurely restricting algorithmic access to high-quality global solutions.
- ⚖️ State Fidelity Preservation: Designing future memory-efficient architectures to smoothly average out extreme gradient shocks without exceeding memory limits.
🔗Citation
If you find our survey and repository useful for your research project, please consider citing our paper:
@article{zhang2026evolution,
title={Evolution of Optimization Methods: Algorithms, Scenarios, and Evaluations},
author={Zhang, Tong and Zhang, Jiangning and Xue, Zhucun and Jiang, Juntao and Xu, Yicheng and Xu, Chengming and Hu, Teng and Xie, Xingyu and Hu, Xiaobin and Wang, Yabiao and others},
journal={arXiv preprint arXiv:2604.12968},
year={2026}
}
📫Contact
186368@zju.edu.cn