pytorch-optimizer

May 23, 2026 · View on GitHub

CI Docs PyPI Python Codecov License Total Downloads Monthly Downloads

pytorch-optimizer is a production-focused optimization toolkit for PyTorch with 100+ optimizers, 10+ learning rate schedulers, and 10+ loss functions behind a consistent API.

Use it when you want fast experimentation with modern training methods without rewriting optimizer boilerplate.

Highly inspired by jettify/pytorch-optimizer.

Why pytorch-optimizer

  • Broad optimizer coverage, including many recent research variants.
  • Consistent loader APIs for optimizers, schedulers, and losses.
  • Practical features such as foreach, Lookahead, and Gradient Centralization integrations.
  • Tested and actively maintained codebase.
  • Works with optional ecosystem integrations like bitsandbytes, q-galore-torch, and torchao.

Installation

Requirements:

  • Python >=3.8
  • PyTorch >=1.10
pip install pytorch-optimizer

Optional integrations are not installed by default:

Quick Start

1) Use an optimizer class directly

from pytorch_optimizer import AdamP

model = YourModel()
optimizer = AdamP(model.parameters(), lr=1e-3)

2) Load by name

from pytorch_optimizer import load_optimizer

model = YourModel()
optimizer = load_optimizer('adamp')(model.parameters(), lr=1e-3)

3) Build with create_optimizer()

from pytorch_optimizer import create_optimizer

model = YourModel()
optimizer = create_optimizer(
    model,
    optimizer_name='adamp',
    lr=1e-3,
    weight_decay=1e-3,
    use_gc=True,
    use_lookahead=True,
)

4) Optional: load via torch.hub

import torch

model = YourModel()
opt_cls = torch.hub.load('kozistr/pytorch_optimizer', 'adamp')
optimizer = opt_cls(model.parameters(), lr=1e-3)

Discover Available Components

Optimizers

from pytorch_optimizer import get_supported_optimizers

all_optimizers = get_supported_optimizers()
adam_family = get_supported_optimizers('adam*')
selected = get_supported_optimizers(['adam*', 'ranger*'])

Learning Rate Schedulers

from pytorch_optimizer import get_supported_lr_schedulers

all_schedulers = get_supported_lr_schedulers()
cosine_like = get_supported_lr_schedulers('cosine*')

Loss Functions

from pytorch_optimizer import get_supported_loss_functions

all_losses = get_supported_loss_functions()
focal_related = get_supported_loss_functions('*focal*')

Supported Optimizers

You can check the supported optimizers with below code.

from pytorch_optimizer import get_supported_optimizers

supported_optimizers = get_supported_optimizers()

or you can also search them with the filter(s).

from pytorch_optimizer import get_supported_optimizers

get_supported_optimizers('adam*')
# ['adamax', 'adamg', 'adammini', 'adamod', 'adamp', 'adams', 'adamw']

get_supported_optimizers(['adam*', 'ranger*'])
# ['adamax', 'adamg', 'adammini', 'adamod', 'adamp', 'adams', 'adamw', 'ranger', 'ranger21']
OptimizerDescriptionOfficial CodePaper(Citation)
AdaBeliefAdapting Step-sizes by the Belief in Observed Gradientsgithubpaper(cite)
AdaBoundAdaptive Gradient Methods with Dynamic Bound of Learning Rategithubpaper(cite)
AdaHessianAn Adaptive Second Order Optimizer for Machine Learninggithubpaper(cite)
AdamDImproved bias-correction in Adampaper(cite)
DualAdamCombining Adam and its Inverse Counterpart to Enhance Generalization of Deep Learning Optimizersgithubpaper(cite)
AdamPSlowing Down the Slowdown for Momentum Optimizers on Scale-invariant Weightsgithubpaper(cite)
diffGradAn Optimization Method for Convolutional Neural Networksgithubpaper(cite)
MADGRADA Momentumized, Adaptive, Dual Averaged Gradient Method for Stochasticgithubpaper(cite)
RAdamOn the Variance of the Adaptive Learning Rate and Beyondgithubpaper(cite)
Rangera synergistic optimizer combining RAdam and LookAhead, and now GC in one optimizergithubpaper(cite)
Ranger21a synergistic deep learning optimizergithubpaper(cite)
LambLarge Batch Optimization for Deep Learninggithubpaper(cite)
ShampooPreconditioned Stochastic Tensor Optimizationgithubpaper(cite)
NeroLearning by Turning: Neural Architecture Aware Optimisationgithubpaper(cite)
AdanAdaptive Nesterov Momentum Algorithm for Faster Optimizing Deep Modelsgithubpaper(cite)
AdaiDisentangling the Effects of Adaptive Learning Rate and Momentumgithubpaper(cite)
SAMSharpness-Aware Minimizationgithubpaper(cite)
ASAMAdaptive Sharpness-Aware Minimizationgithubpaper(cite)
GSAMSurrogate Gap Guided Sharpness-Aware Minimizationgithubpaper(cite)
D-AdaptationLearning-Rate-Free Learning by D-Adaptationgithubpaper(cite)
AdaFactorAdaptive Learning Rates with Sublinear Memory Costgithubpaper(cite)
ApolloAn Adaptive Parameter-wise Diagonal Quasi-Newton Method for Nonconvex Stochastic Optimizationgithubpaper(cite)
NovoGradStochastic Gradient Methods with Layer-wise Adaptive Moments for Training of Deep Networksgithubpaper(cite)
LionSymbolic Discovery of Optimization Algorithmsgithubpaper(cite)
Ali-GAdaptive Learning Rates for Interpolation with Gradientsgithubpaper(cite)
SM3Memory-Efficient Adaptive Optimizationgithubpaper(cite)
AdaNormAdaptive Gradient Norm Correction based Optimizer for CNNsgithubpaper(cite)
RotoGradGradient Homogenization in Multitask Learninggithubpaper(cite)
A2GradOptimal Adaptive and Accelerated Stochastic Gradient Descentgithubpaper(cite)
AccSGDAccelerating Stochastic Gradient Descent For Least Squares Regressiongithubpaper(cite)
SGDWDecoupled Weight Decay Regularizationgithubpaper(cite)
ASGDAdaptive Gradient Descent without Descentgithubpaper(cite)
YogiAdaptive Methods for Nonconvex Optimizationpaper(cite)
SWATSImproving Generalization Performance by Switching from Adam to SGDpaper(cite)
FromageOn the distance between two neural networks and the stability of learninggithubpaper(cite)
MSVAGDissecting Adam: The Sign, Magnitude and Variance of Stochastic Gradientsgithubpaper(cite)
AdaModAn Adaptive and Momental Bound Method for Stochastic Learninggithubpaper(cite)
AggMoAggregated Momentum: Stability Through Passive Dampinggithubpaper(cite)
QHAdamQuasi-hyperbolic momentum and Adam for deep learninggithubpaper(cite)
PIDA PID Controller Approach for Stochastic Optimization of Deep Networksgithubpaper(cite)
Gravitya Kinematic Approach on Optimization in Deep Learninggithubpaper(cite)
AdaSmoothAn Adaptive Learning Rate Method based on Effective Ratiopaper(cite)
SRMMStochastic regularized majorization-minimization with weakly convex and multi-convex surrogatesgithubpaper(cite)
AvaGradDomain-independent Dominance of Adaptive Methodsgithubpaper(cite)
PCGradGradient Surgery for Multi-Task Learninggithubpaper(cite)
AMSGradOn the Convergence of Adam and Beyondpaper(cite)
Lookaheadk steps forward, 1 step backgithubpaper(cite)
PNMManipulating Stochastic Gradient Noise to Improve Generalizationgithubpaper(cite)
GCGradient Centralizationgithubpaper(cite)
AGCAdaptive Gradient Clippinggithubpaper(cite)
Stable WDUnderstanding and Scheduling Weight Decaygithubpaper(cite)
Softplus TCalibrating the Adaptive Learning Rate to Improve Convergence of ADAMpaper(cite)
Un-tuned w/uOn the adequacy of untuned warmup for adaptive optimizationpaper(cite)
Norm LossAn efficient yet effective regularization method for deep neural networkspaper(cite)
AdaShiftDecorrelation and Convergence of Adaptive Learning Rate Methodsgithubpaper(cite)
AdaDeltaAn Adaptive Learning Rate Methodpaper(cite)
AmosAn Adam-style Optimizer with Adaptive Weight Decay towards Model-Oriented Scalegithubpaper(cite)
SignSGDCompressed Optimisation for Non-Convex Problemsgithubpaper(cite)
SophiaA Scalable Stochastic Second-order Optimizer for Language Model Pre-traininggithubpaper(cite)
ProdigyAn Expeditiously Adaptive Parameter-Free Learnergithubpaper(cite)
PAdamClosing the Generalization Gap of Adaptive Gradient Methods in Training Deep Neural Networksgithubpaper(cite)
LOMOFull Parameter Fine-tuning for Large Language Models with Limited Resourcesgithubpaper(cite)
AdaLOMOLow-memory Optimization with Adaptive Learning Rategithubpaper(cite)
LoRARiteLoRA Done RITE: Robust Invariant Transformation Equilibration for LoRA Optimizationgithubpaper(cite)
TigerA Tight-fisted Optimizer, an optimizer that is extremely budget-consciousgithubcite
CAMEConfidence-guided Adaptive Memory Efficient Optimizationgithubpaper(cite)
WSAMSharpness-Aware Minimization Revisited: Weighted Sharpness as a Regularization Termgithubpaper(cite)
AidaA DNN Optimizer that Improves over AdaBelief by Suppression of the Adaptive Stepsize Rangegithubpaper(cite)
GaLoreMemory-Efficient LLM Training by Gradient Low-Rank Projectiongithubpaper(cite)
AdaliteAdalite optimizergithubpaper(cite)
bSAMSAM as an Optimal Relaxation of Bayesgithubpaper(cite)
Schedule-FreeSchedule-Free Optimizersgithubpaper(cite)
FAdamAdam is a natural gradient optimizer using diagonal empirical Fisher informationgithubpaper(cite)
GrokfastAccelerated Grokking by Amplifying Slow Gradientsgithubpaper(cite)
KateRemove that Square Root: A New Efficient Scale-Invariant Version of AdaGradgithubpaper(cite)
FlashAdamWFlashOptim: Optimizers for Memory-Efficient Traininggithubpaper(cite)
StableAdamWStable and low-precision training for large-scale vision-language modelspaper(cite)
AdamMiniUse Fewer Learning Rates To Gain Moregithubpaper(cite)
TRACAdaptive Parameter-free Optimizationgithubpaper(cite)
AdamGTowards Stability of Parameter-free Optimizationpaper(cite)
AdEMAMixBetter, Faster, Oldergithubpaper(cite)
SOAPImproving and Stabilizing Shampoo using Adamgithubpaper(cite)
ADOPTModified Adam Can Converge with Any β2 with the Optimal Rategithubpaper(cite)
FTRLFollow The Regularized Leaderpaper
CautiousImproving Training with One Line of Codegithubpaper(cite)
DeMoDecoupled Momentum Optimizationgithubpaper(cite)
MicroAdamAccurate Adaptive Optimization with Low Space Overhead and Provable Convergencegithubpaper(cite)
MuonMomentUm Orthogonalized by Newton-schulzgithubpaper(cite)
LaPropSeparating Momentum and Adaptivity in Adamgithubpaper(cite)
APOLLOSGD-like Memory, AdamW-level Performancegithubpaper(cite)
MARSUnleashing the Power of Variance Reduction for Training Large Modelsgithubpaper(cite)
SGDSaINo More Adam: Learning Rate Scaling at Initialization is All You Needgithubpaper(cite)
GramsGradient Descent with Adaptive Momentum Scalingpaper(cite)
OrthoGradGrokking at the Edge of Numerical Stabilitygithubpaper(cite)
Adam-ATAN2Scaling Exponents Across Parameterizations and Optimizerspaper(cite)
SPAMSpike-Aware Adam with Momentum Reset for Stable LLM Traininggithubpaper(cite)
TAMTorque-Aware Momentumpaper(cite)
FOCUSFirst Order Concentrated Updating Schemegithubpaper(cite)
PSGDPreconditioned Stochastic Gradient Descentgithubpaper(cite)
EXAdamThe Power of Adaptive Cross-Momentsgithubpaper(cite)
GCSAMGradient Centralized Sharpness Aware Minimizationgithubpaper(cite)
LookSAMTowards Efficient and Scalable Sharpness-Aware Minimizationgithubpaper(cite)
SCIONTraining Deep Learning Models with Norm-Constrained LMOsgithubpaper(cite)
COSMOSSOAP with Muongithub
StableSPAMHow to Train in 4-Bit More Stably than 16-Bit Adamgithubpaper
AdaGCImproving Training Stability for Large Language Model Pretrainingpaper(cite)
Simplified-AdemamixConnections between Schedule-Free Optimizers, AdEMAMix, and Accelerated SGD Variantsgithubpaper(cite)
FiraCan We Achieve Full-rank Training of LLMs Under Low-rank Constraint?githubpaper(cite)
RACS & AliceTowards Efficient Optimizer Design for LLM via Structured Fisher Approximation with a Low-Rank Extensionpaper(cite)
VSGDVariational Stochastic Gradient Descent for Deep Neural Networksgithubpaper(cite)
SNSMSubset-Norm and Subspace-Momentum: Faster Memory-Efficient Adaptive Optimization with Convergence Guaranteesgithubpaper(cite)
AdamCWhy Gradients Rapidly Increase Near the End of Trainingpaper(cite)
AdaMuonAdaptive Muon Optimizerpaper(cite)
SPlusA Stable Whitening Optimizer for Efficient Neural Network Traininggithubpaper(cite)
EmoNaviAn emotion-driven optimizer that feels loss and navigates accordinglygithub
Refined Schedule-FreeThrough the River: Understanding the Benefit of Schedule-Free Methods for Language Model Trainingpaper(cite)
FriendlySAMFriendly Sharpness-Aware Minimizationgithubpaper(cite)
AdaGOAdaGrad Meets Muon: Adaptive Stepsizes for Orthogonal Updatespaper(cite)
CondaColumn-Normalized Adam for Training Large Language Models Fastergithubpaper(cite)
BCOSStochastic Approximation with Block Coordinate Optimal Stepsizesgithubpaper(cite)
Cautious WDCautious Weight Decaypaper(cite)
AnoFaster is Better in Noisy Landscapegithubpaper(cite)
Spectral SphereControlled LLM Training on Spectral Spheregithubpaper(cite)
ROSEStateless optimization through range-normalized gradient updatesgithubpaper(cite)

Supported LR Scheduler

You can check the supported learning rate schedulers with below code.

from pytorch_optimizer import get_supported_lr_schedulers

supported_lr_schedulers = get_supported_lr_schedulers()

or you can also search them with the filter(s).

from pytorch_optimizer import get_supported_lr_schedulers

get_supported_lr_schedulers('cosine*')
# ['cosine', 'cosine_annealing', 'cosine_annealing_with_warm_restart', 'cosine_annealing_with_warmup']

get_supported_lr_schedulers(['cosine*', '*warm*'])
# ['cosine', 'cosine_annealing', 'cosine_annealing_with_warm_restart', 'cosine_annealing_with_warmup', 'warmup_stable_decay']
LR SchedulerDescriptionOfficial CodePaper(Citation)
Explore-ExploitWide-minima Density Hypothesis and the Explore-Exploit Learning Rate Schedulepaper(cite)
ChebyshevAcceleration via Fractal Learning Rate Schedulespaper(cite)
REXRevisiting Budgeted Training with an Improved Schedulegithubpaper(cite)
WSDWarmup-Stable-Decay learning rate schedulergithubpaper(cite)

Supported Loss Function

You can check the supported loss functions with below code.

from pytorch_optimizer import get_supported_loss_functions

supported_loss_functions = get_supported_loss_functions()

or you can also search them with the filter(s).

from pytorch_optimizer import get_supported_loss_functions

get_supported_loss_functions('*focal*')
# ['bcefocalloss', 'focalcosineloss', 'focalloss', 'focaltverskyloss']

get_supported_loss_functions(['*focal*', 'bce*'])
# ['bcefocalloss', 'bceloss', 'focalcosineloss', 'focalloss', 'focaltverskyloss']
Loss FunctionsDescriptionOfficial CodePaper(Citation)
Label SmoothingRethinking the Inception Architecture for Computer Visionpaper(cite)
FocalFocal Loss for Dense Object Detectionpaper(cite)
Focal CosineData-Efficient Deep Learning Method for Image Classification Using Data Augmentation, Focal Cosine Loss, and Ensemblepaper(cite)
LDAMLearning Imbalanced Datasets with Label-Distribution-Aware Margin Lossgithubpaper(cite)
Jaccard (IOU)IoU Loss for 2D/3D Object Detectionpaper(cite)
Bi-TemperedThe Principle of Unchanged Optimality in Reinforcement Learning Generalizationpaper(cite)
TverskyTversky loss function for image segmentation using 3D fully convolutional deep networkspaper(cite)
Lovasz HingeA tractable surrogate for the optimization of the intersection-over-union measure in neural networksgithubpaper(cite)

Documentation

License Notes

Most implementations are under MIT or Apache 2.0 compatible terms from their original sources. Some algorithms (for example Fromage, Nero) are tied to CC BY-NC-SA 4.0, which is non-commercial. Please verify the license of each optimizer before production or commercial use.

Contributing and Community

Citation

Please cite original optimizer authors when you use specific algorithms. If you use this repository, you can use the citation metadata in CITATION or GitHub's "Cite this repository".

@software{Kim_pytorch_optimizer_optimizer_2021,
  author = {Kim, Hyeongchan},
  title = {{pytorch_optimizer: optimizer & lr scheduler & loss function collections in PyTorch}},
  url = {https://github.com/kozistr/pytorch_optimizer},
  year = {2021}
}

Maintainer

Hyeongchan Kim / @kozistr