SAM-SPS
June 2, 2026 · View on GitHub
Adaptive Polyak step sizes for SAM and USAM — match or beat tuned learning rates and Cosine Annealing, with no tuning.
SAM_SPS is the official PyTorch implementation of the optimizer proposed in:
Adaptive Sharpness-Aware Minimization with a Polyak-type Step Size: A Theory-Grounded Scheduler Dimitris Oikonomou, Nicolas Loizou.
The package provides a single torch.optim.Optimizer subclass — SAM_SPS — that wraps the Stochastic Polyak Scheduler, an adaptive learning-rate rule derived from a Polyak-style upper-bound argument on the SAM update. One parameter switches between USAM-SPS () and SAM-SPS (). The result is a SAM-style optimizer with closed-form convergence guarantees (linear for strongly convex, for convex) and competitive deep-learning performance without learning-rate tuning — at larger sharpness radii , it remains stable while Cosine Annealing collapses.
Table of contents
Installation
pip install sam_sps
Or from source:
git clone https://github.com/dimitris-oik/sam_sps.git
cd sam_sps
pip install -e .
Requirements: torch, numpy, scipy (only for the numpy experiments), Python 3.8+.
Quick start
Like all SAM-style optimizers, SAM_SPS performs two forward/backward passes per step and therefore requires a closure that re-evaluates the loss:
import torch
from sam_sps import SAM_SPS
model = MyModel()
criterion = torch.nn.CrossEntropyLoss()
optimizer = SAM_SPS(
model.parameters(),
rho=0.1, # sharpness radius
lambd=1.0, # 0.0 -> USAM-SPS, 1.0 -> SAM-SPS
f_star=0.0, # mini-batch lower bound; 0 for non-negative losses
gamma_b=1.0, # cap on the Stochastic Polyak Scheduler step size
weight_decay=5e-4,
)
for x, y in loader:
def closure():
loss = criterion(model(x), y)
loss.backward()
return loss
optimizer.step(closure)
Unlike tuned constant-LR or Cosine-Annealing baselines, no learning rate needs to be picked or scheduled — the scheduler computes from the loss and gradient at each iteration.
The algorithm
Each iteration performs an ascent step to the perturbed point , then descends from using the gradient at with an adaptive step size:
1. Perturbation. For mini-batch loss and sharpness radius ,
Setting gives USAM's unnormalized perturbation; setting gives SAM's normalized perturbation.
2. Stochastic Polyak Scheduler. The step size minimizes the Polyak-style upper bound on at the perturbed point, capped by :
3. Descent.
When , the rule reduces to the classical Polyak step / (Loizou et al., 2021) for SGD. When , the ReLU safeguard is provably redundant for smooth convex objectives with (Proposition 2.1).
Theoretical guarantees
| Setting | Method | Rate |
|---|---|---|
| Strongly convex, smooth (deterministic) | USAM-SPS | linear, exact (Theorem 3.1) |
| Convex, smooth (deterministic) | USAM-SPS | , exact (Theorem 3.2) |
| Decreasing (deterministic) | USAM-SPS | (Theorem 3.4) |
| Strongly convex, smooth (stochastic) | USAM-SPS | linear, to a neighborhood (Theorem 3.5) |
| Convex, smooth (stochastic) | USAM-SPS | , to a neighborhood (Theorem 3.8) |
| Interpolated () | USAM-SPS | neighborhood collapses; exact convergence (Corollary 3.6) |
The theory is developed for USAM () and extends naturally to SAM () — see §4.3 of the paper.
API reference
SAM_SPS(params, weight_decay=5e-4, rho=0.1, lambd=1.0, f_star=0.0, gamma_b=1.0)
| Argument | Type | Default | Description |
|---|---|---|---|
params | iterable | — | Parameters to optimize. |
weight_decay | float | 5e-4 | L2 weight-decay coefficient applied in the final descent step. |
rho | float | 0.1 | Sharpness radius . |
lambd | float | 1.0 | Interpolation between USAM (0.0) and SAM (1.0). |
f_star | float | 0.0 | Lower bound on the mini-batch loss. Typically 0.0 for non-negative losses. |
gamma_b | float | 1.0 | Upper bound on the Stochastic Polyak Scheduler step size. |
Step methods
| Method | Description |
|---|---|
step(closure) | Standard SAM API: performs both passes in one call. closure must do a full forward+backward and return the loss. |
first_step(zero_grad=False) | (Internal) Ascent step to the perturbed point . |
second_step(zero_grad=False) | (Internal) Restore , compute , then descend. |
In practice you'll only call step(closure). After each call, the active scheduler value is available on group['lr'] of every parameter group, which makes logging trivial.
Experiments
Theory validation (synthetic)
The numpy_exps/ directory reproduces the §4.1 synthetic experiments on a strongly convex ridge-regression problem (, ). Each is run in two regimes — deterministic (full-batch, interpolated) and stochastic (mini-batch, regularized) — and produces two kinds of comparison in each regime:
- Theory comparison — the Polyak Scheduler against prior USAM step-size schedules (Andriushchenko & Flammarion 2022; Khanh et al.; Oikonomou et al.), empirically confirming the linear / rates predicted by Theorems 3.1–3.2.
- Adaptive comparison — the Polyak Scheduler against adaptive-learning-rate SAM optimizers (AdaSAM, LightSAM-I/II/III, SA-SAM).
Files:
numpy_exps/loss.py—RidgeRegressionobjective with controllable conditioning.numpy_exps/methods.py—Unified_SAM(constant step-size baseline),Unified_SAM_SPS(deterministic Polyak Scheduler),Unified_SAM_SPS_max(Stochastic Polyak Scheduler), and theUSAM_andrbaseline (Andriushchenko & Flammarion, 2022).numpy_exps/methods_ada.py— adaptive-LR SAM baselines:AdaSAM,LightSAM_I(AdaGrad-Norm),LightSAM_II(AdaGrad),LightSAM_III(Adam), andSA_SAM.numpy_exps/exps.ipynb— figure-generation notebook.numpy_exps/figures/— the four output PDFs (usam_theory_det,usam_theory_stoch,ada_comparison_det,ada_comparison_stoch).
Deep-learning results
Test accuracy of SAM_SPS with ResNet-32 on CIFAR-100, varying the sharpness radius (bold = best at fixed , mean ± std over 3 seeds, from Tables 3–4 of the paper):
USAM ():
| Constant USAM (tuned) | USAM + Cosine Annealing | USAM-SPS | |
|---|---|---|---|
| 90.56 ± 0.18 | 90.01 ± 0.32 | 91.81 ± 0.04 | |
| 90.45 ± 0.34 | 88.77 ± 0.26 | 92.23 ± 0.22 | |
| 90.25 ± 0.10 | 88.05 ± 0.23 | 92.24 ± 0.30 | |
| 89.56 ± 0.07 | 86.52 ± 0.04 | 92.01 ± 0.12 |
SAM ():
| Constant SAM (tuned) | SAM + Cosine Annealing | SAM-SPS | |
|---|---|---|---|
| 90.17 ± 0.11 | 90.49 ± 0.02 | 91.61 ± 0.12 | |
| 90.53 ± 0.02 | 89.03 ± 0.13 | 92.24 ± 0.07 | |
| 89.61 ± 0.10 | 87.05 ± 0.24 | 91.70 ± 0.15 | |
| 88.64 ± 0.13 | 84.61 ± 0.34 | 90.79 ± 0.16 |
Two key observations:
- No tuning, best accuracy. SAM-SPS / USAM-SPS beat both the constant learning rate tuned per and Cosine Annealing at every radius.
- Robustness at large . Cosine Annealing degrades sharply as grows (CIFAR-100, : USAM Cosine drops to 86.52, SAM Cosine to 84.61), while the Polyak Scheduler stays above 90.7 in both columns.
Full CIFAR-10 / ResNet-20 results and the no-weight-decay ablation are in Appendix E of the paper.
Citation
If you use this code or build on the method, please cite:
@inproceedings{oikonomou2026adaptive,
title = {Adaptive Sharpness-Aware Minimization with a Polyak-type Step Size: A Theory-Grounded Scheduler},
author = {Oikonomou, Dimitris and Loizou, Nicolas},
booktitle = {ICML},
year = {2025},
}
License
Released under the MIT License.