Evolution of Optimization Methods: Algorithms, Scenarios, and Evaluations

June 12, 2026 · View on GitHub

Awesome list badge LICENSE arXiv PRs welcome WeChat Group Github

Tong Zhang, Jiangning Zhang, Zhucun Xue, Juntao Jiang, Yicheng Xu, Chengming Xu, Teng Hu, Xingyu Xie, Xiaobin Hu†, Yabiao Wang, Yong Liu†, Shuicheng Yan


📖 Overview

Foundational optimization algorithms are the core driving force behind deep learning, evolving from early stochastic gradient descent (SGD) to the widely adopted Adam family. However, as the scale of modern foundation models grows massively, this optimization paradigm is forced to expand, encountering new physical and systemic bottlenecks during large-scale training. In particular, stringent differential privacy requirements and distributed training paradigms have exposed critical limitations of conventional approaches regarding privacy protection and memory efficiency.

However, existing reviews on optimization algorithms often focus on narrow technical fields, e.g., first-order and second-order, lacking a comprehensive perspective on the field's evolution, especially regarding Zeroth-order and Scenario-oriented paradigms.

To address these gaps, this survey provides a systematic review of the development of optimization algorithmss, tracing its evolution through four major paradigms:

First-order methodsSecond-order methodsZeroth-order methodsScenario-oriented paradigms

We conduct comprehensive theoretical analysis and standardized empirical evaluations, objectively pointing out the pros, cons, and fundamental design trade-offs of various methods across different architectures. By synthesizing theoretical insights with extensive empirical evidence, we distill key developmental trends and provides actionable guidance and future research directions for designing next-generation efficient, robust, and trustworthy optimization algorithms.

🎯Contributions

1️⃣ Unified Taxonomy: Establishing a rigorous mathematical taxonomy that unifies disparate conceptual definitions across fundamental optimization primitives.

  • 📊 Evolutionary Trajectory tracing the development of foundational algorithms from First-Order to Second-Order and Zeroth-Order methods.
  • 🔬 Intrinsic Connections clarifying the complex evolutionary logic and structural relationships between different optimization approaches to provide a coherent framework for the field.

2️⃣ Scenario-Oriented Analysis: Demonstrating how foundational algorithms are fundamentally re-architected into scenario-oriented paradigms to address severe physical bottlenecks.

  • 📈 Systems-Aware Engineering highlighting the critical shift from pure algorithmic design to practical solutions that balance theoretical guarantees with strict engineering constraints.
  • 🔍 Overcoming Systemic Barriers detailing how these paradigms tackle specific, real-world challenges such as distributed communication barriers and strict differential privacy constraints.

3️⃣ Standardized Evaluation: Introducing a rigorously controlled evaluation framework that strictly separates pure algorithmic performance from large-scale engineering optimizations.

  • 🚀 Extensive Benchmarking developing a standardized testbed to evaluate 23 distinct optimizers across diverse architectural proxies, including CNN and Transformer-based models.
  • 🔮 Strategic Insights systematically isolating and examining learning rate sensitivity, long-term training scalability, and cross-architecture generalization to guide the design of next-generation optimizers.

📈Evolution

📈 A Comprehensive Analysis of Optimization Methods: This figure systematically summarizes the development trends and core characteristics of optimization methodologies across different orders.

  • Key Insights:Attention to optimization algorithms experienced a sharp increase since 2024. This explosive growth is closely tied to the rapid development of massive models, with first-order methods maintaining a dominant position.

🚩Timeline

Timeline of prominent optimization algorithms. The evolution highlights key algorithmic milestones, associated research institutions, and publication venues over time.


🏗️ Architecture Overview

🗂️ A Comprehensive Taxonomy of Optimization Algorithms

📐 Taxonomy Overview: This framework categorizes existing works based on three dominant paradigms, First-Order Methods, Second-Order Methods, and Zeroth-Order Methods, and further structures them according to their fundamental mathematical principles and evolutionary development. Key branches include:

  • 🚀 First-Order Methods: Gradient-Driven (e.g., SGD) \rightarrow Adaptive Learning Rate (e.g., Adam) \rightarrow Acceleration to Automation (e.g., Adan, Nadam) \rightarrow Scalar to Preconditioner (e.g., Shampoo) \rightarrow Stability to Temporal (e.g., SPAM) \rightarrow Temporal to Geometry (e.g., SAM).
  • ⚙️ Second-Order Methods: Deterministic Curvature to Geometry (e.g., K-FAC, AdaFisher) \rightarrow Approximation to Iterative Update (e.g., ADAHESSIAN).
  • 📍 Zeroth-Order Methods: Perturbation Optimization (e.g., FZOO, LeZO) \rightarrow Adaptive to Resource-Aware (e.g., MeZO, ZO-AdaMM) \rightarrow Variance Reduction to Adaptive (e.g., MeZO-SVRG).

📊 Benchmark Evaluation Results on Vision Tasks

📈 Benchmark Evaluation: This comprehensive assessment evaluates 23 representative optimization algorithms across continuous vision architectures (ViT-S and ResNet-50) and varying training horizons:

  • ⏱️ Short-Term Convergence (100 Epochs): Evaluates the rapid descent capability and initial exploration efficiency of optimizers within a constrained computational budget.
  • 🏃 Long-Term Scalability (300 Epochs): Assesses the algorithm's resilience against late-stage gradient noise and its capacity to continually extract representational power over extended cycles.
  • 📊 Ranking Dynamics: Tracks relative performance shifts across epochs, highlighting how algorithms dynamically navigate the trade-off between early acceleration and long-term stability.

🔥Add Your Paper in our Survey!!!!!

  • You are welcome to give us an issue or PR for your optimizer work !!!!!

Note that: Due to the huge paper in arXiv, we are sorry to cover all in our survey. You can directly present a PR into this repo and we will record it for next version update of our survey.

🔥New

  • [2026.06.12] We update GitHub to record the available paper by the end of 2026/6/12.

🌟Welcome everyone to follow and join the Scaling Opt community: Scaling Opt Community

  • Algorithm Visualizations: Includes visualization scripts for the Rosenbrock Function and Rastrigin Function, allowing users to freely explore optimization behaviors.

  • Performance Benchmarks: We primarily recommend benchmarks based on Algoperf, along with other benchmark suites and analysis articles for validating and comparing state-of-the-art optimizers.

  • Papers & Blogs Recommendations: The platform curates high-quality papers and blog posts from recent years, continuously updated with the latest daily arXiv publications. Currently, the collection contains nearly one hundred resources.

  • Tutorials Sharing: The platform gathers high-quality community resources and is actively developing a tutorial series titled From Classical to Modern Optimizers.


🔨Installation

To reproduce our benchmarks, you need to clone this repository and install the required dependencies. We strongly recommend using a virtual environment (e.g., Conda).

# 1. Clone the repository
git clone https://github.com/JZhangTon/awesome-optimizer.git
cd awesome-optimizer

# 2. Install required packages
pip install -r requirements.txt

⚙️Usage & 📈Benchmarking

  • To start a training run and reproduce our benchmark results, you can execute the provided training scripts. We provide a script for easy benchmarking. See examples/benchmark to see how to use it.

🗂️Taxonomy of Optimization Algorithms

🚀 First-Order Algorithms

AbbreviationVenue & YearPaper TitleProjectSub-methodsFine-grained Methods
PionarXiv'2605Rethinking Muon Beyond Pretraining: Spectral Failures and High-Pass Remedies for VLA and RLVRLinkSpectral High PassMatrix Orthogonalization
C-AdamarXiv'2605A Theoretical and Experimental Study of a Novel Adaptive Learning AlgorithmLinkAdaptive Learning Rate MethodsLine-of-Sight Adam Variant
SparseOptarXiv'2605SparseOpt: Addressing Normalization-induced Gradient Skew in Sparse TrainingLinkGradient Normalization & ClippingSparse Training Gradient Rebalancing
AnonarXiv'2605Anon: Extrapolating Optimizer Adaptivity Across the Real SpectrumLinkAdaptive Learning Rate Methods
PS-Clip-SGDarXiv'2605Robust and Fast Training via Per-Sample ClippingLinkEnhancing Training Stability
NoraarXiv'2605Nora: Normalized Orthogonal Row Alignment for Scalable Matrix OptimizerLinkAdaptive Step-Size ControlMatrix Orthogonalization
Muon^2arXiv'2604Muon^2: Boosting Muon via Adaptive Second-Moment PreconditioningLinkAdaptive Step-Size ControlMatrix Orthogonalization
HomeAdamarXiv'2603HomeAdam: Adam and AdamW Algorithms Sometimes Go Home to Obtain Better Provable GeneralizationLinkAdaptive Step-Size ControlSecond-order moment adaptation
FlashOptimarXiv'2602FlashOptim: Optimizers for Memory-Efficient TrainingLinkMemory-Efficient OptimizationLow-Memory Optimizer Design
FANoSarXiv'2601FANoS: Friction-Adaptive Nos´e–Hoover Symplectic Momentum for Stiff ObjectivesLinkAccelerating Convergence RateMomentum Damping Mechanism
NOVAKarXiv'2601NOVAK: Unified adaptive optimizer for deep neural networksLinkHybrid MethodsGradient Smoothing Hybrid
AdamNXarXiv'2511AdamNX: An Adam improvement algorithm based on a novel exponential decay mechanism for the second-order moment estimateLinkAdaptive Learning Rate MethodsSecond-Order Moment Adaptation
ROOTarXiv'2511ROOT: Robust Orthogonalized Optimizer for Neural Network TrainingLinkAdaptive Learning Rate MethodsMomentum-based Adaptive
AuONarXiv'2509AuON: A Linear-time Alternative to Orthogonal Momentum UpdatesLinkGradient Normalization & ClippingLayer-Wise Gradient Normalization
ZetAarXiv'2508ZETA: A HYBRID OPTIMIZER COMBINING RIEMANN ZETA SCALING WITH ADAM FOR ROBUST DEEP LEARNINGLinkHybrid MethodsMulti-Objective Hybrid
NIRMALarXiv'2508COMPARATIVE ANALYSIS OF NOVEL NIRMAL OPTIMIZER AGAINST ADAM AND SGD WITH MOMENTUMLinkHybrid MethodsMulti-Objective Hybrid
SCSAdamWarXiv'2507Beyond First-Order: Training LLMs with Stochastic Conjugate Subgradients and AdamWLinkLoss Landscape OptimizationMomentum Landscape Adaptation
adaNPAGarXiv'2507Boosting Accelerated Proximal Gradient Method with Adaptive Sampling for Stochastic Composite Optimization *LinkMomentum-Enhanced SGDAccelerated Momentum
SoftSignSGDarXiv'2507SoftSignSGD(S3): An Enhanced Optimizer for Practical DNN Training and Loss Spikes Minimization Beyond AdamLinkAdaptive Learning Rate MethodsHybrid Adaptive Strategy
AdaMuonarXiv'2507ADAMUON: ADAPTIVE MUON OPTIMIZERLinkAdaptive Learning Rate MethodsHybrid Adaptive Strategy
Accelerated GRAALarXiv'2507NESTEROV FINDS GRAAL: OPTIMAL AND ADAPTIVE GRADIENT METHOD FOR CONVEX OPTIMIZATIONLinkHybrid MethodsMulti-Objective Hybrid
DEOarXiv'2507Dimer-Enhanced Optimization: A First-Order Approach to Escaping Saddle Points in Neural Network TrainingLinkLoss Landscape OptimizationCurvature-Guided Landscape Exploration
LyAmarXiv'2507LyAm: Robust Non-Convex Optimization for Stable Learning in Noisy EnvironmentsLinkLearning Rate SchedulingStability-Aware Adaptive Scheduling
SplusarXiv'2506A Stable Whitening Optimizer for Efficient Neural Network TrainingLinkPreconditioned Gradient MethodsTwo Metrics' Preconditioner
HGMarXiv'2506Hindsight-Guided Momentum (HGM) Optimizer: An Approach to Adaptive Learning RatesLinkLearning Rate SchedulingGradient Angle Scheduling
AutoSGDarXiv'2505AutoSGD: Automatic Learning Rate Selection for Stochastic Gradient DescentLinkLearning Rate SchedulingScheduler-Free Adaptation
AdamSarXiv'2505AdamS: Momentum Itself Can Be A Normalizer for LLM Pretraining and Post-trainingLinkAdaptive Learning Rate MethodsStateless Adaptation
LightSAMarXiv'2505LightSAM: Parameter-Agnostic Sharpness-Aware MinimizationLinkLoss Landscape OptimizationSharpness-Aware Minimization (SAM)
ADAGB2arXiv'2505Fast Stochastic Second-Order Adagrad for Nonconvex Bound-Constrained OptimizationLinkHybrid MethodsProjection Gradient Hybrid
VRAdamarXiv'2505A Physics-Inspired Optimizer: Velocity Regularized AdamLinkMomentum-Enhanced SGDMomentum Damping Mechanism
SKA-SGDarXiv'2505STREAMING KRYLOV-ACCELERATED STOCHASTIC GRADIENT DESCENTLinkLoss Landscape OptimizationCurvature-Guided Landscape Exploration
Adam-PowerarXiv'2505GradPower: Powering Gradients for Faster Language Model Pre-TrainingLinkAdaptive Learning Rate MethodsSecond-Order Moment Adaptation
AlphaGradarXiv'2504AlphaGrad: Non-Linear Gradient Normalization OptimizerLinkAdaptive Learning Rate Methods; Low-Memory Optimizer Design; Stateless Optimization MethodsStateless Adaptation; Structural Redesign; Parameter Characteristic-Driven Updates
AsyncSAMarXiv'2503ASYNCHRONOUS SHARPNESS-AWARE MINIMIZATION FOR FAST AND ACCURATE DEEP LEARNINGLinkLoss Landscape OptimizationSharpness-Aware Minimization (SAM)
ASGOarXiv'2503ASGO: Adaptive Structured Gradient OptimizationLinkPreconditioned Gradient MethodsSingle Metric's Preconditioner
AdaGCarXiv'2502AdaGC: Improving Training Stability for Large Language Model PretrainingLinkGradient Normalization & Clipping; Robust OptimizationNoise-Robust Normalization; Dynamic Gradient Clipping; Noise-Robust Gradients
AdadiagarXiv'2502Improving Adaptive Moment Optimization viaPreconditioner DiagonalizationLinkHybrid MethodsProjection Gradient Hybrid
eaglearXiv'2502EAGLE: EARLY APPROXIMATED-GRADIENT-BASED LEARNING RATE ESTIMATORLinkAdaptive Learning Rate MethodsMomentum-based Adaptive
Hessian-aware ScalingarXiv'2502First-ish Order Methods: Hessian-aware Scalings of Gradient DescentLinkPreconditioned Gradient MethodsSingle Metric's Preconditioner
GCSAMarXiv'2501GCSAM: Gradient Centralized Sharpness Aware MinimizationLinkGradient Normalization & ClippingMean-Removal Normalization
SGDOarXiv'2501Overshoot: Taking advantage of future gradients in momentum-based stochastic optimizationLinkMomentum-Enhanced SGDAccelerated Momentum
μ²-SGDICLR'25DO STOCHASTIC, FEEL NOISELESS: STABLE STOCHASTIC OPTIMIZATION VIA A DOUBLE MOMENTUM MECHANISMLinkMomentum-Enhanced SGDDouble-momentum mechanism
Stable-SPAMICLR'25Stable-SPAM: How to Train in 4-Bit More Stably than 16-Bit AdamLinkGradient Normalization & Clipping; Privacy-Aware Gradient ClippingLayer-Wise Gradient Normalization; Dynamic Gradient Clipping; Adaptive Clipping
apolloMLSys'25APOLLO:SGD-LIKE MEMORY, ADAMW-LEVEL PERFORMANCELinkAdaptive Learning Rate Methods; Towards LLM TraningHybrid Adaptive Strategy; Gradient Projection Mechanism
SPAMICLR'25SPAM: SPIKE-AWARE ADAM WITH MOMENTUM RESET FOR STABLE LLM TRAININGLinkGradient Normalization & Clipping; Optimizer State CompressionElement-Wise Gradient Scaling; Spike-Aware Gradient Clipping; Sparse State Compression
Coupled AdamACL'25Better Embeddings with Coupled AdamLinkAdaptive Learning Rate MethodsLayer-Wise Adaptation
SWANICML'25SWAN: SGD WITH NORMALIZATION AND WHITENING ENABLES STATELESS LLM TRAININGLinkTowards LLM TraningGradient Preconditioning Mechanism
LDAdamICLR'25LDADAM: ADAPTIVE OPTIMIZATION FROM LOWDIMENSIONAL GRADIENT STATISTICSLinkHybrid MethodsProjection Gradient Hybrid
SSAMJMLR'25Stabilizing Sharpness-aware Minimization Through A Simple Renormalization StrategyLinkLoss Landscape OptimizationRenormalized Gradient Norm SAM
MARSICML'25MARS: Unleashing the Power of Variance Reduction for Training Large ModelsLinkMomentum-Enhanced SGD; Adaptive Learning Rate MethodsDouble-momentum mechanism;Momentum-based Adaptive
VSGDTMLR'25Variational Stochastic Gradient Descent for Deep Neural NetworksLinkAdaptive Learning Rate MethodsSecond-Order Moment Adaptation
MIAdamAAAI'25A Method for Enhancing Generalization of Adam by Multiple IntegrationsLinkLoss Landscape OptimizationCurvature-Guided Landscape Exploration
KOALA++NeurIPS'25KOALA++: Efficient Kalman-Based Optimization with Gradient-Covariance ProductsLinkAdaptive Learning Rate MethodsKalman filtering based
PAdamPCAMMIC'25ADAPTIVE MOMENT ESTIMATION OPTIMIZATION ALGORITHM USING PROJECTION GRADIENT FOR DEEP LEARNINGLinkHybrid MethodsProjection Gradient Hybrid
DecGDMach.Learn.'25A New Adaptive Gradient Method with Gradient DecompositionLinkLearning Rate SchedulingLoss-Sensitive Scheduling
GramsICLR'25 WSGrams: Gradient Descent with Adaptive Momentum ScalingLinkHybrid MethodsMulti-Objective Hybrid
FSGDMICLR'25ON THE PERFORMANCE ANALYSIS OF MOMENTUM METHOD: A FREQUENCY DOMAIN PERSPECTIVELinkMomentum-Enhanced SGDFrequency Domain Momentum Analysis
AdEMAMixICLR'25THE ADEMAMIX OPTIMIZER:BETTER, FASTER, OLDERLinkMomentum-Enhanced SGDDouble-momentum mechanism
HVAdamAAAI'25HVAdam: A Full-Dimension Adaptive OptimizerLinkHybrid MethodsProjection Gradient Hybrid
SGD-SaIarXiv'2412No More Adam: Learning Rate Scaling at Initialization is All You NeedLinkLearning Rate Scheduling; Stateless Optimization MethodsInitial Learning Rate Scaling; Parameter Characteristic-Driven Updates
Adam++arXiv'2412Towards Simple and Provable Parameter-Free Adaptive Gradient MethodsLinkLearning Rate SchedulingScheduler-Free Adaptation
EXADAMarXiv'2412EXADAM: THE POWER OF ADAPTIVE CROSS-MOMENTSLinkAdaptive Learning Rate MethodsHybrid Adaptive Strategy
Cautious OptimizersarXiv'2411Cautious Optimizers: Improving Training with One Line of CodeLinkMomentum-Enhanced SGDMomentum-Gradient Alignment
AGS-GDarXiv'2411Anisotropic Gaussian Smoothing for Gradient-based OptimizationLinkHybrid Methods; Auto-Designed OptimizersGradient Smoothing Hybrid; Automated Discovery&Theoretical Derivation
CAdamarXiv'2411CAdam: Confidence-Based Optimization for Online LearningLinkHybrid MethodsMulti-Objective Hybrid
INNAproparXiv'2410A SECOND-ORDER-LIKE OPTIMIZER WITH ADAPTIVE GRADIENT SCALING FOR DEEP LEARNINGLinkAdaptive Learning Rate MethodsMomentum-based Adaptive
CaAdamarXiv'2410CaAdam: Improving Adam optimizer using connection aware methodsLinkAdaptive Learning Rate MethodsLayer-Wise Adaptation
BADMarXiv'2407BADM: Batch ADMM for Deep LearningLinkHybrid MethodsMulti-Objective Hybrid
FAdamarXiv'2405FAdam: Adam is a natural gradient optimizer using diagonal empirical Fisher informationLinkAdaptive Learning Rate Methods; Hybrid MethodsDynamic Epsilon Adjustment; Multi-Objective Hybrid
MSAMarXiv'2401Momentum-SAM: Sharpness Aware Minimization without Computational OverheadLinkLoss Landscape OptimizationMomentum Landscape Adaptation
LOMOACL'24Full Parameter Fine-tuning for Large Language Models with Limited ResourcesLinkTowards LLM Traning; Memory-Efficient Fine-Tuning for Large ModelsReal-Time Computation; Staless Fine-Tuning
BAdamNeurIPS'24BAdam: A Memory Efficient Full Parameter Optimization Method for Large Language ModelsLinkHybrid Methods; Towards LLM TraningMulti-Objective Hybrid; Real-Time Computation; Block-Wise Computation
DP-AdamBCAAAI'24DP-AdamBC: Your DP-Adam Is Actually DP-SGD (Unless You Apply Bias Correction)LinkAdaptive Learning Rate MethodsSecond-Order Moment Adaptation
Dice-SGDICLR'24DIFFERENTIALLY PRIVATE SGD WITHOUT CLIPPING BIAS: AN ERROR-FEEDBACK APPROACHLinkGradient Normalization & ClippingDP-enhanced Gradient Clipping
FESS-GDAAISTATS'24Stochastic Smoothed Gradient Descent Ascent for Federated Minimax OptimizationLinkHybrid MethodsGradient Filtering Hybrid
AdaSAMNeural Netw.'24AdaSAM: Boosting Sharpness-Aware Minimization with Adaptive Learning Rate and Momentum for Training Deep Neural NetworksLinkHybrid MethodsMulti-Objective Hybrid
SAMPaNeurIPS'24SAMPa: Sharpness-aware Minimization ParallelizedLinkLoss Landscape OptimizationSharpness-Aware Minimization (SAM)
ICML'24Lookbehind-SAM: k steps back, 1 step forwardLinkLoss Landscape OptimizationMulti-Step Ascent SAM
F-SAMCVPR'24Friendly Sharpness-Aware MinimizationLinkLoss Landscape OptimizationNoise Injection Enhancement
FGSAMNeurIPS'24Fast Graph Sharpness-Aware Minimization for Enhancing and Accelerating Few-Shot Node ClassificationLinkLoss Landscape OptimizationNoise Injection Enhancement
AdanTPAMI'24Adan: Adaptive Nesterov Momentum Algorithm for Faster Optimizing Deep ModelsLinkAdaptive Learning Rate MethodsMomentum-based Adaptive
4-bit shampooNeurIPS'244-bit Shampoo for Memory-Efficient Network TrainingLinkPreconditioned Gradient Methods; Low-Memory Optimizer DesignTwo Metrics' Preconditioner; Compression&Approximation of States
MuonBlog'24Muon: An optimizer for hidden layers in neural networksLinkAdaptive Learning Rate MethodsLayer-Wise Adaptation
ADOPTNeurIPS'24ADOPT: Modified Adam Can Converge with Any β2 with the Optimal RateLinkAdaptive Learning Rate MethodsSecond-Order Moment Adaptation
SET-adamECML'24On Suppressing Range of Adaptive Stepsizes of Adam to Improve Generalisation PerformanceLinkAdaptive Learning Rate Methods; Hybrid MethodsSecond-Order Moment Adaptation; Multi-Objective Hybrid
Adam-RealNeurIPS'24Adam on Local Time: Addressing Nonstationarity in RL with Relative Adam TimestepsLinkMomentum-Enhanced SGDScheduled Momentum Reset
SNGMSCIS'24Stochastic Normalized Gradient Descent with Momentum for Large-Batch TrainingLinkMomentum-Enhanced SGDMomentum Damping Mechanism
Schedule-FreeNeurIPS'24The Road Less ScheduledLinkLearning Rate SchedulingScheduler-Free Adaptation
AUTODROPUAI'24AUTODROP: TRAINING DEEP LEARNING MODELS WITH AUTOMATIC LEARNING RATE DROPLinkAdaptive Learning Rate Methods; Learning Rate SchedulingStateless Adaptation; Scheduler-Free Adaptation
ADAACTICDMW'24AN ADAPTIVE METHOD STABILIZING ACTIVATIONS FOR ENHANCED GENERALIZATIONLinkAdaptive Learning Rate Methods; Optimizer State CompressionNeuron-Level Adaptation; State Sharing
MoMoICML'24MoMo: Momentum Models for Adaptive Learning RatesLinkAdaptive Learning Rate MethodsMomentum-based Adaptive
RSGDMCCSB'24Reducing Bias in Deep Learning Optimization: The RSGDM ApproachLinkMomentum-Enhanced SGDAccelerated Momentum
NYSACTBigData'24NYSACT: A SCALABLE PRECONDITIONED GRADIENT DESCENT USING NYSTRÖM APPROXIMATIONLinkPreconditioned Gradient MethodsSingle Metric's Preconditioner
SGDFarXiv'2311Signal Processing Meets SGD: From Momentum to FilterLinkMomentum-Enhanced SGDDynamic Momentum Weight
AdaLOMOarXiv'2310AdaLomo: Low-memory Optimization with Adaptive Learning RateLinkAdaptive Learning Rate MethodsSecond-Order Moment Adaptation
ICML'23SGD with Large Step Sizes Learns Sparse FeaturesLinkLearning Rate SchedulingStability-Aware Adaptive Scheduling
look aroundNeurIPS'23Lookaround Optimizer: k steps around, 1 step averageLinkLoss Landscape OptimizationWeight Averaging
GAMCVPR'23Gradient Norm Aware Minimization Seeks First-Order Flatness and ImprovesGeneralizationLinkLoss Landscape OptimizationCurvature-Guided Landscape Exploration
AE-SAMICLR'23AN ADAPTIVE POLICY TO EMPLOY SHARPNESS-AWARE MINIMIZATIONLinkLoss Landscape OptimizationSharpness-Aware Minimization (SAM)
AidaTMLR'23A DNN Optimizer that Improves over AdaBelief by Suppression of the Adaptive Stepsize RangeLinkAdaptive Learning Rate MethodsPrediction Deviation Adaptation
LionNeurIPS'23Symbolic Discovery of Optimization AlgorithmsLinkAdaptive Learning Rate Methods; Auto-Designed OptimizersMomentum-based Adaptive; Automated Discovery&Theoretical Derivation
AdamMCCVMI'23Moment Centralization based Gradient Descent Optimizers for Convolutional Neural NetworksLinkGradient Normalization & ClippingMean-Removal Normalization
MultiAdamICML'23MultiAdam: Parameter-wise Scale-invariant Optimizer for Multiscale Training of Physics-informed Neural NetworksLinkGradient Normalization & ClippingLayer-Wise Gradient Normalization
AdaNormWACV'23AdaNorm: Adaptive Gradient Norm Correction based Optimizer for CNNsLinkGradient Normalization & Clipping; Robust OptimizationElement-Wise Gradient Scaling; Noise-Robust Normalization; Noise-Robust Gradients
AGDNeurIPS'23AGD: an Auto-switchable Optimizer using Stepwise Gradient Difference for Preconditioning MatrixLinkAdaptive Learning Rate MethodsSecond-Order Moment Adaptation
RLEKFAAAI'23RLEKF: An Optimizer for Deep Potential with Ab Initio AccuracyLinkAdaptive Learning Rate MethodsKalman filtering based
AmosarXiv'2210Amos: AN ADAM-STYLE OPTIMIZER WITH ADAPTIVE WEIGHT DECAY TOWARDS MODEL-ORIENTED SCALELinkLearning Rate SchedulingScheduler-Free Adaptation
AdaBFEarXiv'2207BFE and AdaBFE: A New Approach in Learning Rate Automation for Stochastic OptimizationLinkLearning Rate Scheduling; Stateless Optimization MethodsGradient Angle Scheduling; Parameter Characteristic-Driven Updates
DP-SGDarXiv'2206Normalized/Clipped SGD with Perturbation for Differentially Private Non-Convex OptimizationLinkGradient Normalization & ClippingBasic Fixed Gradient Clipping
AdamFamilyarXiv'2203AdaFamily: A family of Adam-like adaptive gradient methodsLinkAdaptive Learning Rate MethodsHybrid Adaptive Strategy
SRSGDSIAM IMS'22Scheduled Restart Momentum for Accelerated Stochastic Gradient DescentLinkMomentum-Enhanced SGDScheduled Momentum Reset
Step-Tuned SGDNPL'22Second-order step-size tuning of SGD for non-convex optimizationLinkLearning Rate SchedulingScheduler-Free Adaptation
AEGDMAAM'22AN ADAPTIVE GRADIENT METHOD WITH ENERGY AND MOMENTUMLinkAdaptive Learning Rate MethodsStateless Adaptation
AdaInjectITAI'22AdaInject: Injection Based Adaptive Gradient Descent Optimizers for Convolutional Neural NetworksLinkAdaptive Learning Rate MethodsMomentum-based Adaptive
KOALAAAAI'22KOALA: A Kalman Optimization Algorithm with Loss AdaptivityLinkAdaptive Learning Rate MethodsKalman filtering based
ESAMICLR'22EFFICIENT SHARPNESS-AWARE MINIMIZATION FOR IMPROVED TRAINING OF NEURAL NETWORKSLinkLoss Landscape OptimizationSharpness-Aware Minimization (SAM)
MADGRADJMLR'22Adaptivity without Compromise: A Momentumized, Adaptive, Dual Averaged Gradient Method for Stochastic OptimizationLinkMomentum-Enhanced SGD; Hybrid MethodsAccelerated Momentum; Multi-Objective Hybrid
GDA-AMICLR'22GDA-AM: On the effectiveness of solving minimax optimization via Anderson AccelerationLinkHybrid MethodsMulti-Objective Hybrid
AdamDarXiv'2110AdamD: Improved bias-correction in AdamLinkAdaptive Learning Rate MethodsBias Correction Rules Adaptaion
AdaLarXiv'2107AdaL: Adaptive Gradient Transformation Contributes to Convergences and GeneralizationsLinkAdaptive Learning Rate MethodsHybrid Adaptive Strategy
AngularGradarXiv'2105AngularGrad: A New Optimization Technique for Angular Convergence of Neural NetworksLinkMomentum-Enhanced SGDMomentum-Gradient Alignment
SGD-G2ICPR'21Stochastic Runge-Kutta methods and adaptive SGD-G2 stochastic gradient descentLinkLearning Rate SchedulingScheduler-Free Adaptation
SQuARM-SGDJSAIT'21SQuARM-SGD: Communication-Efficient Momentum SGD for Decentralized OptimizationLinkMomentum-Enhanced SGD; Local Update Strategies; Distributed Hybrid OptimizationAccelerated Momentum; Local SGD; Local Momentum Updates; Compression & Local Updates
SAMICLR'21Sharpness-Aware Minimization for Efficiently Improving GeneralizationLinkLoss Landscape OptimizationSharpness-Aware Minimization (SAM)
AvaGradCVPR'21Domain-independent Dominance of Adaptive MethodsLinkAdaptive Learning Rate MethodsDecoupled Learning Rate and Adaptability
MadamECML'21MaxVA: Fast Adaptation of Step Sizes by Maximizing Observed Variance of GradientsLinkAdaptive Learning Rate MethodsSecond-Order Moment Adaptation
ACMoAAAI'21ACMO: ANGLE-CALIBRATED MOMENT METHODS FOR STOCHASTIC OPTIMIZATIONLinkLearning Rate SchedulingGradient Angle Scheduling
AdamP / SGDPICLR'21AdamP: Slowing Down the Slowdown for Momentum Optimizers on Scale-invariant WeightsLinkMomentum-Enhanced SGD; Hybrid MethodsMomentum Damping Mechanism; Projection Gradient Hybrid
ACPropNeurIPS'21Momentum Centering and Asynchronous Update for Adaptive Gradient MethodsLinkAdaptive Learning Rate MethodsHybrid Adaptive Strategy
Adam+arXiv'2011Adam+: A Stochastic Method with Adaptive Variance ReductionLinkAdaptive Learning Rate MethodsMomentum-based Adaptive
EAdamarXiv'2011EAdam Optimizer: How ∈ Impact AdamLinkAdaptive Learning Rate MethodsDynamic Epsilon Adjustment
AdaSGDarXiv'2006AdaSGD: Bridging the gap between SGD and AdamLinkAdaptive Learning Rate Methods; Hybrid MethodsHybrid Adaptive Strategy; SGD-Adam Hybrid
ADASarXiv'2006ADAS: ADAPTIVE SCHEDULING OF STOCHASTIC GRADIENTSLinkLearning Rate SchedulingScheduler-Free Adaptation
LaProparXiv'2002LaProp: Separating Momentum and Adaptivity in AdamLinkAdaptive Learning Rate Methods
Multistage SGDMNeurIPS'20An Improved Analysis of Stochastic Gradient Descent with MomentumLinkLearning Rate SchedulingStability-Aware Adaptive Scheduling
pbSGDIJCAI'20pbSGD: Powered Stochastic Gradient Descent Methods for Accelerated Non-Convex OptimizationLinkGradient Normalization & ClippingElement-Wise Gradient Scaling
clipped-SGDNeurIPS'20Stochastic Optimization with Heavy-Tailed Noise via Accelerated Gradient ClippingLinkGradient Normalization & ClippingBasic Fixed Gradient Clipping
Cayley SGDICLR'20EFFICIENT RIEMANNIAN OPTIMIZATION ON THE STIEFEL MANIFOLD VIA THE CAYLEY TRANSFORMLinkHybrid MethodsProjection Gradient Hybrid
NIGTICML'20Momentum Improves Normalized SGDLinkGradient Normalization & ClippingNoise-Robust Normalization
AdaBeliefNeurIPS'20AdaBelief Optimizer: Adapting Stepsizes by the Belief in Observed GradientsLinkAdaptive Learning Rate MethodsSecond-Order Moment Adaptation; Prediction Deviation Adaptation
RAdamICLR'20On the Variance of the Adaptive Learning Rate and BeyondLinkAdaptive Learning Rate MethodsMomentum-based Adaptive
AdamBSNeurIPS'20Adam with Bandit Sampling for Deep LearningLinkHybrid MethodsMulti-Objective Hybrid
DEAMASONAM'20DEAM: Adaptive Momentum with Discriminative Weight for Stochastic OptimizationLinkMomentum-Enhanced SGDDynamic Momentum Weight
LAMBICLR'20Large Batch Optimization for Deep Learning: Training BERT in 76 minutesLinkAdaptive Learning Rate Methods; Learning Rate SchedulingLayer-Wise Adaptation; Batch-Aware Scheduling
ADASSarXiv'1906ADASS: Adaptive Sample Selection for Training AccelerationLinkHybrid MethodsGradient Filtering Hybrid
NovoGradarXiv'1905Stochastic Gradient Methods with Layer-wise Adaptive Moments for Training of Deep NetworksLinkAdaptive Learning Rate MethodsLayer-Wise Adaptation
AdamW / SGDWICLR'19Decoupled Weight Decay RegularizationLinkAdaptive Learning Rate Methods
QHadamICLR'19QUASI-HYPERBOLIC MOMENTUM AND ADAM FORDEEP FOR DEEP LEARNINGLinkAdaptive Learning Rate Methods
HAdamNeurIPS'19On Higher-order Moments in AdamLinkAdaptive Learning Rate MethodsSecond-Order Moment Adaptation
diffGradTNNLS'19diffGrad: An Optimization Method for Convolutional Neural NetworksLinkAdaptive Learning Rate MethodsHybrid Adaptive Strategy
NosAdamIJCAI'19Nostalgic Adam: Weighting more of the past gradients when designing the adaptive learning rateLinkAdaptive Learning Rate MethodsSecond-Order Moment Adaptation
LookaheadNeurIPS'19Lookahead Optimizer: k steps forward, 1 step backLinkHybrid MethodsMulti-Objective Hybrid
AdaBoundICLR'19Adaptive Gradient Methods with Dynamic Bound of Learning RateLinkLearning Rate SchedulingElement-Wise Learning Rate Scheduling
LazyOptimizerBlog'19LinkMomentum-Enhanced SGDScheduled Momentum Reset
YOGINeurIPS'18Adaptive Methods for Nonconvex OptimizationLinkMomentum-Enhanced SGDDouble-momentum mechanism
VR-SGDTKDE'18VR-SGD: A Simple Stochastic Variance Reduction Method for Machine LearningLinkHybrid MethodsGradient Filtering Hybrid
ShampooICML'18Shampoo: Preconditioned Stochastic Tensor OptimizationLinkPreconditioned Gradient MethodsTwo Metrics' Preconditioner
MSVAGICML'18DissectingAdam:TheSign,MagnitudeandVarianceofStochasticGradientsLinkAdaptive Learning Rate MethodsSecond-Order Moment Adaptation
PIDOptimizerCVPR'18A PID Controller Approach for Stochastic Optimization of Deep NetworksLinkMomentum-Enhanced SGDMomentum Damping Mechanism
LARSarXiv'1708Large batch training of Convolutional NetworkLinkAdaptive Learning Rate MethodsLayer-Wise Adaptation
NAdamICLR'16 WSIncorporating Nesterov Momentum into AdamLinkAdaptive Learning Rate MethodsMomentum-based Adaptive
AdamICLR'15ADAM: A METHOD FOR STOCHASTIC OPTIMIZATIONLinkAdaptive Learning Rate Methods
SGDMICML'13On the importance of initialization and momentum in deep learningLinkMomentum-Enhanced SGDAccelerated Momentum
AdaDeltaarXiv'1212ADADELTA:ANADAPTIVELEARNINGRATEMETHODLinkAdaptive Learning Rate MethodsSecond-Order Moment Adaptation
AdaGradJMLR'11Adaptive Subgradient Methods for Online Learning and Stochastic OptimizationLinkAdaptive Learning Rate MethodsSecond-Order Moment Adaptation

⚙️ Second-Order Algorithms

AbbreviationVenue & YearPaper TitleProjectSub-methodsFine-grained Methods
C-ALADINarXiv'2606A Global Convergence Analysis of Consensus ALADIN for Convex OptimizationLinkHessian Approximation & EstimationDistributed Newton-Type Approximation
S-BFGSarXiv'2507EFFICIENT STOCHASTIC BFGS METHODS INSPIRED BY BAYESIAN PRINCIPLESLinkQuasi-Newton MethodsStochastic BFGS
MACarXiv'2506MAC: AN EFFICIENT GRADIENT PRECONDITIONING USING MEAN ACTIVATION APPROXIMATED CURVATURELinkFisher Information Matrix ApplicationCurvature-Aware Approximation
RACSarXiv'2502Towards Efficient Optimizer Design for LLM via Structured Fisher Approximation with a Low-Rank ExtensionLinkFisher Information Matrix ApplicationDiagonal Fisher Approximation
OCARarXiv'2502Online Curvature-Aware Replay: Leveraging 2nd Order Information for Online Continual LearningLinkFisher Information Matrix ApplicationDiagonal Fisher Approximation
FUSE-PVCAI'25FUSE: First-Order and Second-Order Unified SynthEsis in Stochastic OptimizationLinkQuasi-Newton MethodsStochastic BFGS
SASSHAICML'25SASSHA: Sharpness-aware Adaptive Second-order Optimization with Stable Hessian ApproximationLinkHessian Approximation & EstimationDiagonal Hessian Approximation
AdaFisherICLR'25ADAFISHER: ADAPTIVE SECOND ORDER OPTIMIZATION VIA FISHER INFORMATIONLinkFisher Information Matrix ApplicationDiagonal Fisher Approximation; Block-Diagonal Kronecker Approximation
OptiQarXiv'2410Second-Order Optimization via QuiescenceLinkCurvature-Guided PreconditioningHessian Diagonal Preconditioning
SOAAarXiv'2410EFFICIENT SECOND-ORDER NEURAL NETWORK OPTIMIZATION VIA ADAPTIVE TRUST REGION METHODSLinkFisher Information Matrix ApplicationDiagonal Fisher Approximation
CRNASarXiv'2407Novel Optimization Techniques for Parameter EstimationLinkHessian Approximation & EstimationDiagonal Hessian Approximation
AthenaarXiv'2405Athena: Efficient Block-Wise Post-Training Quantization for Large Language Models Using Second-Order Matrix Derivative InformationLinkHessian Approximation & EstimationBlock Hessian Approximation
Q-NewtonarXiv'2405Q-Newton: Hybrid Quantum-Classical Scheduling for Accelerating Neural Network Training with Newton’s Gradient DescentLinkHessian Approximation & EstimationBlock Hessian Approximation
SkechySGDSIAM'24SketchySGD: Reliable Stochastic Optimization via Randomized Curvature EstimatesLinkHessian Approximation & EstimationStochastic Hessian Sampling
sophiaICLR'24Sophia: A Scalable Stochastic Second-order Optimizer for Language Model Pre-trainingLinkHessian Approximation & Estimation; Curvature-Guided Preconditioning; Second-Order Moment Fusion; Privacy-Aware Gradient Clipping; Stateless Optimization MethodsDiagonal Hessian Approximation; Hessian Diagonal Preconditioning; Noise-Robust Second-Order Momentum; Real-Time Curvature Estimation
Fed-SophiaICC'24Fed-Sophia: A Communication-Efficient Second-Order Federated Learning AlgorithmLinkHessian Approximation & Estimation; Federated Learning OptimizationDiagonal Hessian Approximation; Federated Second-Order Optimization
HesScaleICML'24Revisiting Scalable Hessian Diagonal Approximations for Applications in Reinforcement LearningLinkHessian Approximation & EstimationDiagonal Hessian Approximation
mL-BFGSTMLR'23mL-BFGS: A Momentum-based L-BFGS for Distributed Large-Scale Neural Network OptimizationLinkQuasi-Newton MethodsStochastic BFGS; Low-Memory Quasi-Newton
SGDHessNeurIPS'22Better SGD using Second-order MomentumLinkHessian Approximation & EstimationGradient Difference Estimation
AdaHessianAAAI'21AdaHessian: An Adaptive Second Order Optimizer for Machine LearningLinkHessian Approximation & Estimation; Second-Order Moment FusionDiagonal Hessian Approximation; Noise-Robust Second-Order Momentum
TKFACAAAI'21A Trace-restricted Kronecker-Factored Approximation to Natural GradientLinkFisher Information Matrix ApplicationTrace-Preserving Fisher Approximation
SGNarXiv'2006On the Promise of the Stochastic Generalized Gauss-Newton Method for Training DNNsLinkHessian Approximation & EstimationStochastic Hessian Sampling
SpiderSQNIFAC-Pap.'20A FAST QUASI-NEWTON-TYPE METHOD FOR LARGESCALE STOCHASTIC OPTIMISATIONLinkQuasi-Newton MethodsStochastic BFGS
K-BFGS and K-BFGS(L),NeurIPS'20Practical Quasi-Newton Methods for Training Deep Neural NetworksLinkQuasi-Newton MethodsLow-Memory Quasi-Newton
K-FACICML'15Optimizing Neural Networks with Kronecker-factored Approximate CurvatureLinkFisher Information Matrix ApplicationBlock-Diagonal Kronecker Approximation
Natural GradientNeural Comput.'1998Natural gradient works efficiently in learningFisher Information Matrix Application
BFGSSIAM SC'1995A limited memory algorithm for bound constrained optimizationQuasi-Newton Methods
Newton's MethodANL'1982Newton's methodHessian Approximation & Estimation
L-BFGSMath.Comput.'1980Updating quasi-newton matrices with limited storageQuasi-Newton MethodsStochastic BFGS
Gauss-Newton MethodBiometrika'1974Quasi-Likelihood Functions, Generalized Linear Models, and the Gauss-Newton MethodHessian Approximation & Estimation

📍 Zeroth-Order Algorithms

AbbreviationVenue & YearPaper TitleProjectSub-methodsFine-grained Methods
AdaMeZOarXiv'2605AdaMeZO: Adam-style Zeroth-Order Optimizer for LLM Fine-tuning Without Maintaining the MomentsLinkAdaptive Methods
MEAZOarXiv'2605On Adaptivity in Zeroth-Order OptimizationLinkMemory-Efficient Methods
ZO-SAHarXiv'2507Subspace-based Approximate Hessian Method for Zeroth-Order OptimizationLinkAdaptive MethodsProjection-based Adaptive
FZOOarXiv'2506FZOO: Fast Zeroth-Order Optimizer for Fine-Tuning Large Language Models towards Adam-Scale SpeedLinkVariance ReductionStructured Variance Control
VR-SZDarXiv'2506A Structured Proximal Stochastic Variance Reduced Zeroth-order AlgorithmLinkVariance ReductionSnapshot Variance Reduction
KerZOOarXiv'2505KerZOO: Kernel Function Informed Zeroth-Order Optimization for Accurate and Accelerated LLM Fine-TuningLinkMemory-efficient MethodsInference-Level Memory Zeroth-Order
QZOarXiv'2505Fine-tuning Quantized Neural Networks with Zeroth-order OptimizationLinkMemory-efficient MethodsQuantized Zeroth-Order Finetuning
VAMOarXiv'2505VAMO: Efficient Large-Scale Nonconvex Optimization via Adaptive Zeroth Order Variance ReductionLinkZeroth-First Order Hybrid; Variance ReductionVariance Reduction Hybrid; Snapshot Variance Reduction
ZO2arXiv'2503ZO2: Scalable Zeroth-Order Fine-Tuning for Extremely Large Language Models with Limited GPU MemoryLinkMemory-efficient MethodsInference-Level Memory Zeroth-Order
LORENZAarXiv'2502LORENZA: Enhancing Generalization in Low-Rank Gradient LLM Training and Fine-Tuning via Efficient Zeroth-Order Adaptive SAM OptimizationLinkAdaptive MethodsMomentum-based Adaptive; Low-Rank Adaptive
QuZOarXiv'2502QuZO: Quantized Zeroth-Order Fine-Tuning for Large Language ModelsLinkMemory-efficient Methods; Low-Rank MethodsQuantized Zeroth-Order Finetuning; Low-Rank & Quantization
MaZOarXiv'2502MaZO: Masked Zeroth-Order Optimization for Multi-Task Fine-Tuning of Large Language ModelsLinkMemory-efficient MethodsSparse Parameter Zeroth-Order
DiZOarXiv'2502Harmony in Divergence: Towards Fast, Accurate, and Memory-efficient Zeroth-order LLM Fine-tuningLinkAdaptive MethodsProjection-based Adaptive
TeZOarXiv'2501TeZO: Empowering the Low-Rankness on the Temporal Dimension in the Zeroth-Order Optimization for Fine-tuning LLMsLinkMemory-efficient MethodsLow-Rank Zeroth-Order Finetuning
ELASTICZOarXiv'2501ELASTICZO: A MEMORY-EFFICIENT ON-DEVICE LEARNING WITH COMBINED ZEROTH- AND FIRST-ORDER OPTIMIZATIONLinkZeroth-First Order HybridLayer-Wise Hybrid
LOZOICLR'25Enhancing zeroth-order fine-tuning for language models with low-rank structuresLinkMemory-efficient MethodsLow-Rank Zeroth-Order Finetuning
AddaxICLR'25Addax: Utilizing Zeroth-Order Gradients to Improve Memory Efficiency and Performance of SGD for Fine-Tuning Language ModelsLinkZeroth-First Order HybridWeighted Hybrid
ZOQOICASSP'25ZOQO: Zero-Order Quantized OptimizationLinkMemory-efficient MethodsQuantized Zeroth-Order Finetuning
R-AdaZOICML'25Refining Adaptive Zeroth-Order Optimization at EaseLinkAdaptive MethodMomentum-based Adaptive
ZO-AdaMMNeurIPS'25Zeroth-Order Adaptive Momentum Method for Black-Box OptimizationLinkAdaptive MethodsMomentum-based Adaptive
LeZOarXiv'2410SIMULTANEOUS COMPUTATION AND MEMORY EFFICIENT ZEROTH-ORDER OPTIMIZER FOR FINE-TUNING LARGE LANGUAGE MODELSLinkPerturbation Optimization; Memory-efficient Methods; Memory-Efficient Fine-Tuning for Large ModelsSparse Perturbation; Sparse Parameter Zeroth-Order; Selective Parameter Fine-Tuning
SuZeroarXiv'2410Zeroth-Order Fine-Tuning of LLMs in Random SubspacesLinkMemory-efficient MethodsLow-Rank Zeroth-Order Finetuning
Sparse MeZOarXiv'2402Sparse MeZO: Less Parameters for Better Performance in Zeroth-Order LLM Fine-TuningLinkPerturbation Optimization; Memory-efficient Methods; Memory-Efficient Fine-Tuning for Large ModelsSparse Perturbation; Sparse Parameter Zeroth-Order; Selective Parameter Fine-Tuning
MeZO-SVRGICLR'24Variance-reduced Zeroth-Order Methods for Fine-Tuning Language ModelsLinkVariance ReductionSnapshot Variance Reduction
ZO-AdaMUAAAI'24ZO-AdaMU Optimizer: Adapting Perturbation by the Momentum and Uncertainty in Zeroth-order OptimizationLinkPerturbation Optimization; Memory-efficient MethodsPaired Perturbation Sampling; Inference-Level Memory Zeroth-Order
ZoProCDC'24A Zeroth-Order Proximal Algorithm for Consensus OptimizationLinkDistributed Zero-Order OptimizationDistributed Perturbation Sampling
MeZONeurIPS'23Fine-Tuning Language Models with Just Forward PassesLinkMemory-efficient MethodsInference-Level Memory Zeroth-Order
TOP-DPIEEE Trans'21Topology-aware Differential Privacy for Decentralized Image ClassificationLinkDistributed Zero-Order Optimization; Differential Privacy OptimizationPrivacy-Preserving Zeroth-Order; DP-SGD Variants; Dynamic Noise Scheduling; Privacy-Utility Balance
SPSAACC'01Global random optimization by simultaneous perturbation stochastic approximationPerturbation OptimizationPaired Perturbation Sampling

🌐 Distributed Optimization

AbbreviationVenue & YearPaper TitleProjectSub-methodsFine-grained Methods
AlignFedarXiv'2606AlignFed: Alignment-Aware Asynchronous Federated Fine-Tuning for Large Language Models in Heterogeneous Edge EnvironmentsLinkFederated Learning OptimizationAsynchronous Federated Aggregation
DECAarXiv'2606DECA: Decentralizing Block-Wise Adam for Efficient LLM Full-Parameter Fine-Tuning on Non-IID DataLinkFederated Learning OptimizationBlock-Wise Decentralized Adam
FedSIRarXiv'2604FedSIR: Spectral Client Identification and Relabeling for Federated Learning with Noisy LabelsLinkFederated Learning Optimization
Ringleader ASGDarXiv'2601First Provably Optimal Asynchronous SGD for Homogeneous and Heterogeneous DataLinkLocal Update StrategiesLocal-global hybrid updates
FedMuonarXiv'2510FedMuon: Accelerating Federated Learning with Matrix OrthogonalizationLinkFederated Learning OptimizationFederated Momentum Fusion
DLAS-R-FTCarXiv'2508Distributed Optimization and Learning for Automated Stepsize Selection with Finite Time CoordinationLinkDecentralized CommunicationDistributed Consensus Optimization
DOMEarXiv'2507Communication Efficient, Differentially Private Distributed Optimization using Correlation-Aware SketchingLinkGradient Compression & QuantizationLow-Rank Gradient Compression
Deco-SGDarXiv'2507DeCo-SGD: Joint Optimization of Delay Staleness and Gradient Compression Ratio for Distributed SGDLinkGradient Compression & Quantization; Local Update StrategiesAdaptive Compression Level; Local-Global Hybrid Updates
TAH-QUANTarXiv'2506TAH-QUANT: Effective Activation Quantization in Pipeline Parallelism over Slow NetworkLinkGradient Compression & QuantizationQuantization Compression
LQ-SGDarXiv'2506Trustworthy Efficient Communication for Distributed Learning using LQ-SGD AlgorithmLinkGradient Compression & Quantization; Low-Rank MethodsQuantization Compression; Low-Rank & Quantization
FedCurvarXiv'2506Blockchain-Enabled Privacy-Preserving Second-Order Federated Edge Learning in Personalized HealthcareLinkFederated Learning OptimizationFederated Second-Order Optimization
pFedSOParXiv'2506pFedSOP : Accelerating Training Of Personalized Federated Learning Using Second-Order OptimizationLinkFederated Learning OptimizationFederated Second-Order Optimization
FedOnearXiv'2506FedOne: Query-Efficient Federated Learning for Black-box Discrete Prompt LearningLinkFederated Learning OptimizationClient Sampling Optimization
DEC-LOCarXiv'2505DES-LOC: Desynced Low Communication Adaptive Optimizers for Training Foundation ModelsLinkDecentralized CommunicationDistributed Consensus Optimization
Kuramoto-FedAvgarXiv'2505Kuramoto-FedAvg: Using Synchronization Dynamics to Improve Federated Learning Optimization under Statistical HeterogeneityLinkFederated Learning OptimizationFederated Momentum Fusion
AbsSADMMarXiv'2505Stochastic ADMM with batch size adaptation for nonconvex nonsmooth optimizationLinkLocal Update StrategiesAdaptive Local Steps
ADEFarXiv'2503Accelerated Distributed Optimization with Compression and Error FeedbackLinkGradient Compression & QuantizationCompression Error Compensation
FedCETarXiv'2503Communication Efficient Federated Learning with Linear Convergence on Heterogeneous DataLinkLocal Update StrategiesAdaptive Local Steps
Interleaved-ShuffleGarXiv'2502The Cost of Shuffling in Private Gradient Based OptimizationLinkDecentralized Communication; Differential Privacy OptimizationPrivacy-Preserving Decentralization; Privacy-Utility Balance
FAdamGCarXiv'2502Gradient Correction in Federated Learning with Adaptive OptimizationLinkFederated Learning OptimizationClient Sampling Optimization
LT-ADMMarXiv'2501Communication-Efficient Stochastic Distributed LearningLinkDecentralized CommunicationDistributed Consensus Optimization
HybridSGDarXiv'2501Communication-Efficient, 2D Parallel Stochastic Gradient Descent for Distributed-Memory OptimizationLinkLocal Update StrategiesLocal-Global Hybrid Updates
DAT-SGDICML'25Enhancing Parallelism in Decentralized Stochastic Convex OptimizationLinkDecentralized CommunicationNeighbor Communication Topology
FedSTaSarXiv'2412FedSTaS: Client Stratification and Client Level Sampling for Efficient Federated LearningLinkFederated Learning OptimizationClient Sampling Optimization
FedIvonarXiv'2411Federated Learning with Uncertainty and Personalization via Efficient Second-order OptimizationLinkFederated Learning OptimizationPersonalized Federated Optimization
FAGHarXiv'2403FAGH: Accelerating Federated Learning with Approximated Global HessianLinkFederated Learning OptimizationFederated Second-Order Optimization
AdaFedAdamTMLCN'24ACCELERATING FAIR FEDERATED LEARNING: ADAPTIVE FEDERATED ADAMLinkFederated Learning OptimizationFederated Momentum Fusion
MM-PSGDMMAsia'24Distributed Optimization over Block-Cyclic DataLinkFederated Learning OptimizationPersonalized Federated Optimization
MC-PSGDMMAsia'24Distributed Optimization over Block-Cyclic DataLinkFederated Learning OptimizationPersonalized Federated Optimization
FedLionICASSP'24FEDLION: FASTER ADAPTIVE FEDERATED OPTIMIZATION WITH FEWER COMMUNICATIONLinkFederated Learning OptimizationFederated Momentum Fusion
FADASICML'24FADAS: Towards Federated Adaptive Asynchronous OptimizationLinkFederated Learning OptimizationFederated Momentum Fusion
FLeNSBigData'24FLeNS: Federated Learning with Enhanced Nesterov-Newton SketchLinkFederated Learning OptimizationFederated Momentum Fusion
FedRepOptACCV'24FedRepOpt: Gradient Re-parametrized Optimizers in Federated LearningLinkFederated Learning OptimizationFederated Momentum Fusion
Fed-SophiaICC'24Fed-Sophia: A Communication-Efficient Second-Order Federated Learning AlgorithmLinkHessian Approximation & Estimation; Federated Learning OptimizationDiagonal Hessian Approximation; Federated Second-Order Optimization
FedLAP-DParXiv'2302FedLAP-DP: Federated Learning by Sharing Differentially Private Loss ApproximationsLinkDecentralized CommunicationPrivacy-Preserving Decentralization
AdaCGDTMLR'23Adaptive Compression for Communication-Efficient Distributed TrainingLinkGradient Compression & QuantizationAdaptive Compression Level
0/1 AdamICLR'23Maximizing Communication Efficiency for Large-scale Training via 0/1 AdamLinkGradient Compression & QuantizationQuantization Compression
SketchedAMSGradICDM'22Communication-Efficient Adam-Type Algorithms for Distributed Data MiningLinkGradient Compression & QuantizationLow-Rank Gradient Compression
SPARQ-SGDTAC'22SPARQ-SGD: Event-Triggered and Compressed Communication in Decentralized Stochastic OptimizationLinkGradient Compression & QuantizatioSparsification Compression
1-bit AdamICML'211-bit Adam: Communication Efficient Large-Scale Training with Adam’s Convergence SpeedLinkGradient Compression & QuantizationQuantization Compression
BVR-L-SGDICML'21Bias-Variance Reduced Local SGD for Less Heterogeneous Federated LearningLinkLocal Update StrategiesLocal SGD
A(DP)^2SGDTPAMI'21A(DP)^2SGD: Asynchronous Decentralized Parallel Stochastic Gradient Descent with Differential PrivacyLinkDecentralized CommunicationNeighbor Communication Topology
SQuARM-SGDJSAIT'21SQuARM-SGD: Communication-Efficient Momentum SGD for Decentralized OptimizationLinkMomentum-Enhanced SGD; Local Update Strategies; Distributed Hybrid OptimizationAccelerated Momentum; Local SGD; Local Momentum Updates; Compression & Local Updates
DLCParXiv'2008Domain-specific Communication Optimization for Distributed DNN TrainingLinkLocal Update StrategiesLocal SGD
APMSqueezearXiv'2008APMSqueeze: A Communication Efficient Adam-Preconditioned Momentum SGD AlgorithmLinkGradient Compression & QuantizationCompression Error Compensation
DEED-GDarXiv'2006DEED: A General Quantization Scheme for Communication Efficiency in BitsLinkGradient Compression & QuantizationQuantization Compression
DP-PASGDarXiv'2003Differentially Private Federated Learning for Resource-Constrained Internet of ThingsLinkLocal Update StrategiesAdaptive Local Steps
LAGS-SGDECAI'20Layer-wise Adaptive Gradient Sparsification for Distributed Deep Learning with Convergence GuaranteesLinkGradient Compression & QuantizationAdaptive Compression Level
FedACNeurIPS'20Federated Accelerated Stochastic Gradient DescentLinkFederated Learning OptimizationFederated Momentum Fusion
rTop-kJSAIT'20rTop-k: A Statistical Estimation Approach to Distributed SGDLinkGradient Compression & QuantizationSparsification Compression
Qsparse-local-SGDJSAIT'20Qsparse-local-SGD: Distributed SGD with Quantization, Sparsification, and Local ComputationsLinkDistributed Hybrid OptimizationCompression & Local Updates
SLOWMOICLR'20SLOWMO: IMPROVING COMMUNICATION-EFFICIENT DISTRIBUTED SGD WITH SLOW MOMENTUMLinkLocal Update StrategiesLocal SGD
SCAFFOLDICML'20SCAFFOLD: Stochastic Controlled Averaging for Federated LearningLinkLocal Update StrategiesLocal SGD
LD-SGDarXiv'1910Communication-Efficient Local Decentralized SGD MethodsLinkDecentralized CommunicationNeighbor Communication Topology
PowerSGDNeurIPS'19PowerSGD: Practical Low-Rank Gradient Compression for Distributed OptimizationLinkGradient Compression & QuantizationLow-Rank Gradient Compression
signProxICASSP'19signProx: One-Bit Proximal Algorithm for Nonconvex Stochastic OptimizationLinkGradient Compression & QuantizationQuantization Compression
signSGDICML'18signSGD: Compressed Optimisation for Non-Convex ProblemsLinkGradient Compression & QuantizationQuantization Compression

🛡️ Privacy-Preserving Optimization

AbbreviationVenue & YearPaper TitleProjectSub-methodsFine-grained Methods
PrivCode++arXiv'2606PrivCode++: Latent-Conditioned Differentially Private Code Generation for Comprehensive GuaranteesLinkDifferential Privacy OptimizationDP Fine-Tuning for Code Generation
DP-MacAdamarXiv'2606DP-MacAdam: Differentially Private Mechanism with Adaptive Clipping and Adaptive MomentumLinkDifferential Privacy OptimizationAdaptive Clipping and Momentum
DPSR-CGarXiv'2606Revisiting Privacy Amplification by Subsampling in Selective Release DPSGDLinkDifferential Privacy OptimizationSelective Release DP-SGD
PINAarXiv'2604Differentially Private Clustered Federated Learning with Privacy-Preserving Initialization and Normality-Driven AggregationLinkDifferential Privacy Optimization
DP-aware AdaLN-ZeroarXiv'2602DP-aware AdaLN-Zero: Taming Conditioning-Induced Heavy-Tailed Gradients in Differentially Private DiffusionLinkDifferential Privacy OptimizationDynamic noise scheduling
DP-λCGDarXiv'2601DP-λCGD: Efficient Noise Correlation for Differentially Private Model TrainingLinkDifferential Privacy OptimizationDP-SGD variants
RaCO-DParXiv'2505Private Rate-Constrained Optimization with Applications to Fair LearningLinkDifferential Privacy OptimizationDP-SGD Variants
Interleaved-ShuffleGarXiv'2502The Cost of Shuffling in Private Gradient Based OptimizationLinkDecentralized Communication; Differential Privacy OptimizationPrivacy-Preserving Decentralization; Privacy-Utility Balance
DPZVarXiv'2502DPZV: Elevating the Tradeoff between Privacy and Utility in Zeroth-Order Vertical Federated LearningLinkGradient Noise InjectionNoise-Robust Optimization
Stable-SPAMICLR'25Stable-SPAM: How to Train in 4-Bit More Stably than 16-Bit AdamLinkGradient Normalization & Clipping; Privacy-Aware Gradient ClippingLayer-Wise Gradient Normalization; Dynamic Gradient Clipping; Adaptive Clipping
GeoDP-SGDICDE'25Analyzing and Optimizing Perturbation of DP-SGD GeometricallyLinkGradient Noise InjectionNoise-Robust Optimization
Logit-DPICLR'25DIFFERENTIALLY PRIVATE OPTIMIZATION FOR NONDECOMPOSABLE OBJECTIVE FUNCTIONSLinkPrivacy-Aware Gradient ClippingGlobal Gradient Clipping
DOPPLERNeurIPS'25DOPPLER: Differentially Private Optimizers with Low-pass Filter for Privacy Noise ReductionLinkDifferential Privacy OptimizationDP-SGD Variants
DC-SGDTIFS'25DC-SGD: Differentially Private SGD with Dynamic Clipping through Gradient Norm Distribution EstimationLinkDifferential Privacy Optimization; Gradient Noise Injection; Privacy-Utility Tradeoff; Privacy-Aware Gradient ClippingDynamic Noise Scheduling; Dynamic Clipping Threshold; Adaptive Clipping
SPARTAKDD'25SPARTA: An Optimization Framework for Differentially Private Sparse Fine-TuningLinkDifferential Privacy OptimizationDP-SGD Variants
DP-AdamW-BCICML'25DP-AdamW: Investigating Decoupled Weight Decay and Bias Correction in Private Deep LearningLinkDifferential Privacy OptimizationDP-SGD Variants
DP-MicroAdamNeurIPS'25DP-MicroAdam: Private and Frugal Algorithm for Training and Fine-tuningLinkDifferential Privacy OptimizationDP-SGD Variants
AClipped-dpSGDMach. Learn.'24Efficient Private SCO for Heavy-Tailed Data via Averaged ClippingLinkPrivacy-Aware Gradient ClippingGlobal Gradient Clipping
sophiaICLR'24Sophia: A Scalable Stochastic Second-order Optimizer for Language Model Pre-trainingLinkHessian Approximation & Estimation; Curvature-Guided Preconditioning; Second-Order Moment Fusion; Privacy-Aware Gradient Clipping; Stateless Optimization MethodsDiagonal Hessian Approximation; Hessian Diagonal Preconditioning; Noise-Robust Second-Order Momentum; Real-Time Curvature Estimation
ANSGDarXiv'2305Learning across Data Owners with Joint Differential PrivacyLinkDifferential Privacy OptimizationDP-SGD Variants
DP-AdamICLR'23DP-ADAM: CORRECTING DP BIAS IN ADAM’S SECOND MOMENT ESTIMATIONLinkDifferential Privacy Optimization; Gradient Noise InjectionDP-SGD Variants; Noise-Robust Optimization
DP-FedSAMCVPR'23Make Landscape Flatter in Differentially Private Federated LearningLinkGradient Noise Injection; Federated Privacy EnhancementNoise-Robust Optimization; Federated Noise Aggregation
DPISCCS'22DPIS: An Enhanced Mechanism for Differentially Private SGD with Importance SamplingLinkDifferential Privacy OptimizationDP-SGD Variants
DP-SGD-JLNeurIPS'21Fast and Memory Efficient Differentially Private-SGD via JL ProjectionsLinkDifferential Privacy OptimizationDP-SGD Variants
TOP-DPIEEE Trans'21Topology-aware Differential Privacy for Decentralized Image ClassificationLinkDistributed Zero-Order Optimization; Differential Privacy OptimizationPrivacy-Preserving Zeroth-Order; DP-SGD Variants; Dynamic Noise Scheduling; Privacy-Utility Balance
DP-LSSGDPMLR'20DP-LSSGD: A Stochastic Optimization Method to Lift the Utility in Privacy-Preserving ERMLinkPrivacy-Utility TradeoffPost-Processing Optimization

⚡ Memory-Efficient Optimization

AbbreviationVenue & YearPaper TitleProjectSub-methodsFine-grained Methods
DL-ZOarXiv'2606Dominant-Layer ZO: A Single Layer Dominates Zeroth-Order Fine-Tuning of LLMsLinkMemory-efficient MethodsLayer-Selective Fine-Tuning
LQ-SGDarXiv'2506Trustworthy Efficient Communication for Distributed Learning using LQ-SGD AlgorithmLinkGradient Compression & Quantization; Low-Rank MethodsQuantization Compression; Low-Rank & Quantization
SUMOarXiv'2505SUMO: Subspace-Aware Moment-Orthogonalization for Accelerating Memory-Efficient LLM TrainingLinkLow-Rank Gradient StorageGradient Low-Rank Projection
AlphaGradarXiv'2504AlphaGrad: Non-Linear Gradient Normalization OptimizerLinkAdaptive Learning Rate Methods; Low-Memory Optimizer Design; Stateless Optimization MethodsStateless Adaptation; Structural Redesign; Parameter Characteristic-Driven Updates
QuZOarXiv'2502QuZO: Quantized Zeroth-Order Fine-Tuning for Large Language ModelsLinkMemory-efficient Methods; Low-Rank MethodsQuantized Zeroth-Order Finetuning; Low-Rank & Quantization
GWTarXiv'2501Wavelet Meets Adam: Compressing Gradients for Memory-Efficient TrainingLinkLow-Rank Gradient StorageGradient Low-Rank Projection
AdaRankGradICLR'25Adarankgrad: Adaptive gradient-rank and moments for memory-efficient llms training and fine-tuningLinkLow-Rank MethodsProjection&Adjustment
Adam-miniICLR'25ADAM-MINI: USE FEWER LEARNING RATES TO GAIN MORELinkLow-Memory Optimizer DesignStructural Redesign
SPAMICLR'25SPAM: SPIKE-AWARE ADAM WITH MOMENTUM RESET FOR STABLE LLM TRAININGLinkGradient Normalization & Clipping; Optimizer State CompressionElement-Wise Gradient Scaling; Spike-Aware Gradient Clipping; Sparse State Compression
SGD-SaIarXiv'2412No More Adam: Learning Rate Scaling at Initialization is All You NeedLinkLearning Rate Scheduling; Stateless Optimization MethodsInitial Learning Rate Scaling; Parameter Characteristic-Driven Updates
A-GNBarXiv'2411HELENE: HESSIAN LAYER-WISE CLIPPING AND GRADIENT ANNEALING FOR ACCELERATING FINETUNING LLM WITH ZEROTH-ORDER OPTIMIZATIONLinkLow-Rank Gradient StorageDynamic Gradient Rank
LeZOarXiv'2410SIMULTANEOUS COMPUTATION AND MEMORY EFFICIENT ZEROTH-ORDER OPTIMIZER FOR FINE-TUNING LARGE LANGUAGE MODELSLinkPerturbation Optimization; Memory-efficient Methods; Memory-Efficient Fine-Tuning for Large ModelsSparse Perturbation; Sparse Parameter Zeroth-Order; Selective Parameter Fine-Tuning
AdapproxarXiv'2403Adapprox: Adaptive Approximation in Adam Optimization via Randomized Low-Rank MatricesLinkLow-Rank MethodsProjection&Adjustment
Sparse MeZOarXiv'2402Sparse MeZO: Less Parameters for Better Performance in Zeroth-Order LLM Fine-TuningLinkPerturbation Optimization; Memory-efficient Methods; Memory-Efficient Fine-Tuning for Large ModelsSparse Perturbation; Sparse Parameter Zeroth-Order; Selective Parameter Fine-Tuning
LOMOACL'24Full Parameter Fine-tuning for Large Language Models with Limited ResourcesLinkTowards LLM Traning; Memory-Efficient Fine-Tuning for Large ModelsReal-Time Computation; Staless Fine-Tuning
MICROADAMNeurIPS'24MICROADAM: Accurate Adaptive Optimization with Low Space Overhead and Provable ConvergenceLinkOptimizer State CompressionSparse State Compression
4-bit shampooNeurIPS'244-bit Shampoo for Memory-Efficient Network TrainingLinkPreconditioned Gradient Methods; Low-Memory Optimizer DesignTwo Metrics' Preconditioner; Compression&Approximation of States
ADAACTICDMW'24AN ADAPTIVE METHOD STABILIZING ACTIVATIONS FOR ENHANCED GENERALIZATIONLinkAdaptive Learning Rate Methods; Optimizer State CompressionNeuron-Level Adaptation; State Sharing
sophiaICLR'24Sophia: A Scalable Stochastic Second-order Optimizer for Language Model Pre-trainingLinkHessian Approximation & Estimation; Curvature-Guided Preconditioning; Second-Order Moment Fusion; Privacy-Aware Gradient Clipping; Stateless Optimization MethodsDiagonal Hessian Approximation; Hessian Diagonal Preconditioning; Noise-Robust Second-Order Momentum; Real-Time Curvature Estimation
tpSGDSMC'23Learning with Local Gradients at the EdgeLinkLow-Memory Optimizer DesignStructural Redesign
AdaBFEarXiv'2207BFE and AdaBFE: A New Approach in Learning Rate Automation for Stochastic OptimizationLinkLearning Rate Scheduling; Stateless Optimization MethodsGradient Angle Scheduling; Parameter Characteristic-Driven Updates
AdafactorICML'18Adafactor: Adaptive Learning Rates with Sublinear Memory CostLinkLow-Memory Optimizer Design; Optimizer State CompressionCompression&Approximation of States; State Sharing

🧩 Tailored Optimization Approaches

AbbreviationVenue & YearPaper TitleProjectSub-methodsFine-grained Methods
MAdamarXiv'2606MAdam: Metric-Aware Multi-Objective AdamLinkHybrid MethodsMulti-Objective Adaptive Strategy
KOarXiv'2505KO: Kinetics-inspired Neural Optimizer with PDE Simulation ApproachesLinkAuto-Designed OptimizersAutomated Discovery&Theoretical Derivation
BC-ADMMarXiv'2504BC-ADMM: An Efficient Non-convex Constrained Optimizer with Robotic ApplicationsLinkRobust OptimizationStructure-Aware Optimization
AdaGCarXiv'2502AdaGC: Improving Training Stability for Large Language Model PretrainingLinkGradient Normalization & Clipping; Robust OptimizationNoise-Robust Normalization; Dynamic Gradient Clipping; Noise-Robust Gradients
AGS-GDarXiv'2411Anisotropic Gaussian Smoothing for Gradient-based OptimizationLinkHybrid Methods; Auto-Designed OptimizersGradient Smoothing Hybrid; Automated Discovery&Theoretical Derivation
WarpAdamarXiv'2409WarpAdam: A new Adam optimizer based on Meta-Learning approachLinkAuto-Designed OptimizersEvolutionary Strategies&Meta-Adaptive Learning
MADAICML'24MADA: Meta-Adaptive Optimizers through hyper-gradient DescentLinkAuto-Designed OptimizersEvolutionary Strategies&Meta-Adaptive Learning
LionNeurIPS'23Symbolic Discovery of Optimization AlgorithmsLinkAdaptive Learning Rate Methods; Auto-Designed OptimizersMomentum-based Adaptive; Automated Discovery&Theoretical Derivation
AdaNormWACV'23AdaNorm: Adaptive Gradient Norm Correction based Optimizer for CNNsLinkGradient Normalization & Clipping; Robust OptimizationElement-Wise Gradient Scaling; Noise-Robust Normalization; Noise-Robust Gradients
BGADAMIJCNN'21BGADAM: Boosting based Genetic-Evolutionary ADAM for Neural Network OptimizationLinkAuto-Designed OptimizersEvolutionary Strategies&Meta-Adaptive Learning
HyperAdamAAAI'19HyperAdam: A Learnable Task-Adaptive Adam for Network TrainingLinkAuto-Designed OptimizersEvolutionary Strategies&Meta-Adaptive Learning
GADAMarXiv'1805GADAM: Genetic-Evolutionary ADAM for Deep Neural Network OptimizationLinkAuto-Designed OptimizersEvolutionary Strategies&Meta-Adaptive Learning

🔬Future Prospect

Challenges

  • ⚖️ Multi-objective Trade-off Bottlenecks: Optimizing large models often forces a choice between convergence speed, memory efficiency, and distributed scaling, which can sacrifice generalization, introduce latency, or exacerbate instability. The core challenge is breaking these interconnected bottlenecks under a unified framework.
  • 💾 Memory and Computational Overheads: Dense optimizer states create severe memory bottlenecks. Structural approximations, such as matrix inversions, significantly decrease global efficiency, and per-step computational latency often negates theoretical advantages in iteration count.
  • 🔊 Noise Amplification and Estimation Variance: Anisotropic loss landscapes heavily amplify stochastic mini-batch noise. Random perturbations used for directional gradients suffer from approximation variance that scales poorly with dimensionality, and privacy-preserving noise degrades gradient fidelity.
  • 🤖 Automated Symbolic Discovery: Shifting from fragile heuristic tuning to the automated generation of architecture-specific optimizers, enabling models to inherently navigate complex loss landscapes without manual intervention.
  • 🧮 Preconditioning and Orthogonalization: Leveraging structural gradient statistics for preconditioning or matrix orthogonalization (e.g., Kron and Muon) to overcome the representational bottlenecks of simple diagonal scaling and open novel parameter space pathways.

Opptunities

  • 🧩 Deep Integration of Multi-Order Algorithms: Moving beyond isolated algorithmic improvements by deeply integrating FO, SO, and ZO algorithms. The fundamental focus will shift from minimizing iteration complexity to improving global wall-clock efficiency.
  • 🧭 Dimensionality-Robust Subspace Projection: Discovering advanced subspace projection methods that safely constrain massive search spaces without prematurely restricting algorithmic access to high-quality global solutions.
  • ⚖️ State Fidelity Preservation: Designing future memory-efficient architectures to smoothly average out extreme gradient shocks without exceeding memory limits.

🔗Citation

If you find our survey and repository useful for your research project, please consider citing our paper:

@article{zhang2026evolution,
  title={Evolution of Optimization Methods: Algorithms, Scenarios, and Evaluations},
  author={Zhang, Tong and Zhang, Jiangning and Xue, Zhucun and Jiang, Juntao and Xu, Yicheng and Xu, Chengming and Hu, Teng and Xie, Xingyu and Hu, Xiaobin and Wang, Yabiao and others},
  journal={arXiv preprint arXiv:2604.12968},
  year={2026}
}

📫Contact

186368@zju.edu.cn