SAM-SPS

June 2, 2026 · View on GitHub

Adaptive Polyak step sizes for SAM and USAM — match or beat tuned learning rates and Cosine Annealing, with no γ\gamma tuning.

PyPI version arXiv License: MIT Python 3.8+

SAM_SPS is the official PyTorch implementation of the optimizer proposed in:

Adaptive Sharpness-Aware Minimization with a Polyak-type Step Size: A Theory-Grounded Scheduler Dimitris Oikonomou, Nicolas Loizou.

The package provides a single torch.optim.Optimizer subclass — SAM_SPS — that wraps the Stochastic Polyak Scheduler, an adaptive learning-rate rule derived from a Polyak-style upper-bound argument on the SAM update. One parameter λ\lambda switches between USAM-SPS (λ=0\lambda=0) and SAM-SPS (λ=1\lambda=1). The result is a SAM-style optimizer with closed-form convergence guarantees (linear for strongly convex, O(1/T)O(1/T) for convex) and competitive deep-learning performance without learning-rate tuning — at larger sharpness radii ρ\rho, it remains stable while Cosine Annealing collapses.


Table of contents


Installation

pip install sam_sps

Or from source:

git clone https://github.com/dimitris-oik/sam_sps.git
cd sam_sps
pip install -e .

Requirements: torch, numpy, scipy (only for the numpy experiments), Python 3.8+.


Quick start

Like all SAM-style optimizers, SAM_SPS performs two forward/backward passes per step and therefore requires a closure that re-evaluates the loss:

import torch
from sam_sps import SAM_SPS

model     = MyModel()
criterion = torch.nn.CrossEntropyLoss()

optimizer = SAM_SPS(
    model.parameters(),
    rho=0.1,             # sharpness radius
    lambd=1.0,           # 0.0 -> USAM-SPS, 1.0 -> SAM-SPS
    f_star=0.0,          # mini-batch lower bound; 0 for non-negative losses
    gamma_b=1.0,         # cap on the Stochastic Polyak Scheduler step size
    weight_decay=5e-4,
)

for x, y in loader:
    def closure():
        loss = criterion(model(x), y)
        loss.backward()
        return loss
    optimizer.step(closure)

Unlike tuned constant-LR or Cosine-Annealing baselines, no learning rate needs to be picked or scheduled — the scheduler computes γt\gamma_t from the loss and gradient at each iteration.


The algorithm

Each iteration performs an ascent step to the perturbed point ete^t, then descends from xtx^t using the gradient at ete^t with an adaptive step size:

1. Perturbation. For mini-batch loss fStf_{S_t} and sharpness radius ρ\rho,

et=xt+ρ(1λ+λfSt(xt))fSt(xt).e^t = x^t + \rho \left( 1 - \lambda + \frac{\lambda}{\|\nabla f_{S_t}(x^t)\|} \right) \nabla f_{S_t}(x^t).

Setting λ=0\lambda = 0 gives USAM's unnormalized perturbation; setting λ=1\lambda = 1 gives SAM's normalized perturbation.

2. Stochastic Polyak Scheduler. The step size minimizes the Polyak-style upper bound on xt+1x2\|x^{t+1} - x^*\|^2 at the perturbed point, capped by γb\gamma_b:

γt=min{[fSt(et)StfSt(et), etxt]+fSt(et)2, γb}.\gamma_t = \min\left\lbrace \frac{\big[f_{S_t}(e^t) - \ell^*_{S_t} - \langle \nabla f_{S_t}(e^t),\ e^t - x^t \rangle\big]_+}{\|\nabla f_{S_t}(e^t)\|^2},\ \gamma_b \right\rbrace.

3. Descent.

xt+1=xtγtfSt(et).x^{t+1} = x^t - \gamma_t \nabla f_{S_t}(e^t).

When ρ=0\rho = 0, the rule reduces to the classical Polyak step / SPSmax\mathrm{SPS}_{\max} (Loizou et al., 2021) for SGD. When λ=0\lambda = 0, the ReLU safeguard max(0,)\max(0, \cdot) is provably redundant for smooth convex objectives with ρ1/L\rho \le 1/L (Proposition 2.1).

Theoretical guarantees

SettingMethodRate
Strongly convex, smooth (deterministic)USAM-SPSlinear, exact (Theorem 3.1)
Convex, smooth (deterministic)USAM-SPSO(1/T)O(1/T), exact (Theorem 3.2)
Decreasing ρt0\rho_t \downarrow 0 (deterministic)USAM-SPSf(xt)0\|\nabla f(x^t)\| \to 0 (Theorem 3.4)
Strongly convex, smooth (stochastic)USAM-SPSlinear, to a neighborhood (Theorem 3.5)
Convex, smooth (stochastic)USAM-SPSO(1/T)O(1/T), to a neighborhood (Theorem 3.8)
Interpolated (σ2=0\sigma^2 = 0)USAM-SPSneighborhood collapses; exact convergence (Corollary 3.6)

The theory is developed for USAM (λ=0\lambda = 0) and extends naturally to SAM (λ=1\lambda = 1) — see §4.3 of the paper.


API reference

SAM_SPS(params, weight_decay=5e-4, rho=0.1, lambd=1.0, f_star=0.0, gamma_b=1.0)

ArgumentTypeDefaultDescription
paramsiterableParameters to optimize.
weight_decayfloat5e-4L2 weight-decay coefficient applied in the final descent step.
rhofloat0.1Sharpness radius ρ\rho.
lambdfloat1.0Interpolation between USAM (0.0) and SAM (1.0).
f_starfloat0.0Lower bound St\ell^*_{S_t} on the mini-batch loss. Typically 0.0 for non-negative losses.
gamma_bfloat1.0Upper bound γb\gamma_b on the Stochastic Polyak Scheduler step size.

Step methods

MethodDescription
step(closure)Standard SAM API: performs both passes in one call. closure must do a full forward+backward and return the loss.
first_step(zero_grad=False)(Internal) Ascent step to the perturbed point ete^t.
second_step(zero_grad=False)(Internal) Restore xtx^t, compute γt\gamma_t, then descend.

In practice you'll only call step(closure). After each call, the active scheduler value is available on group['lr'] of every parameter group, which makes logging trivial.


Experiments

Theory validation (synthetic)

The numpy_exps/ directory reproduces the §4.1 synthetic experiments on a strongly convex ridge-regression problem (n=d=100n = d = 100, κ(A)=10\kappa(A) = 10). Each is run in two regimes — deterministic (full-batch, interpolated) and stochastic (mini-batch, regularized) — and produces two kinds of comparison in each regime:

  • Theory comparison — the Polyak Scheduler against prior USAM step-size schedules (Andriushchenko & Flammarion 2022; Khanh et al.; Oikonomou et al.), empirically confirming the linear / O(1/T)O(1/T) rates predicted by Theorems 3.1–3.2.
  • Adaptive comparison — the Polyak Scheduler against adaptive-learning-rate SAM optimizers (AdaSAM, LightSAM-I/II/III, SA-SAM).

Files:

  • numpy_exps/loss.pyRidgeRegression objective with controllable conditioning.
  • numpy_exps/methods.pyUnified_SAM (constant step-size baseline), Unified_SAM_SPS (deterministic Polyak Scheduler), Unified_SAM_SPS_max (Stochastic Polyak Scheduler), and the USAM_andr baseline (Andriushchenko & Flammarion, 2022).
  • numpy_exps/methods_ada.py — adaptive-LR SAM baselines: AdaSAM, LightSAM_I (AdaGrad-Norm), LightSAM_II (AdaGrad), LightSAM_III (Adam), and SA_SAM.
  • numpy_exps/exps.ipynb — figure-generation notebook.
  • numpy_exps/figures/ — the four output PDFs (usam_theory_det, usam_theory_stoch, ada_comparison_det, ada_comparison_stoch).

Deep-learning results

Test accuracy of SAM_SPS with ResNet-32 on CIFAR-100, varying the sharpness radius ρ\rho (bold = best at fixed ρ\rho, mean ± std over 3 seeds, from Tables 3–4 of the paper):

USAM (λ=0\lambda = 0):

Constant USAM (tuned)USAM + Cosine AnnealingUSAM-SPS
ρ=0.1\rho = 0.190.56 ± 0.1890.01 ± 0.3291.81 ± 0.04
ρ=0.2\rho = 0.290.45 ± 0.3488.77 ± 0.2692.23 ± 0.22
ρ=0.3\rho = 0.390.25 ± 0.1088.05 ± 0.2392.24 ± 0.30
ρ=0.4\rho = 0.489.56 ± 0.0786.52 ± 0.0492.01 ± 0.12

SAM (λ=1\lambda = 1):

Constant SAM (tuned)SAM + Cosine AnnealingSAM-SPS
ρ=0.1\rho = 0.190.17 ± 0.1190.49 ± 0.0291.61 ± 0.12
ρ=0.2\rho = 0.290.53 ± 0.0289.03 ± 0.1392.24 ± 0.07
ρ=0.3\rho = 0.389.61 ± 0.1087.05 ± 0.2491.70 ± 0.15
ρ=0.4\rho = 0.488.64 ± 0.1384.61 ± 0.3490.79 ± 0.16

Two key observations:

  1. No tuning, best accuracy. SAM-SPS / USAM-SPS beat both the constant learning rate tuned per ρ\rho and Cosine Annealing at every radius.
  2. Robustness at large ρ\rho. Cosine Annealing degrades sharply as ρ\rho grows (CIFAR-100, ρ=0.4\rho = 0.4: USAM Cosine drops to 86.52, SAM Cosine to 84.61), while the Polyak Scheduler stays above 90.7 in both columns.

Full CIFAR-10 / ResNet-20 results and the no-weight-decay ablation are in Appendix E of the paper.


Citation

If you use this code or build on the method, please cite:

@inproceedings{oikonomou2026adaptive,
  title  = {Adaptive Sharpness-Aware Minimization with a Polyak-type Step Size: A Theory-Grounded Scheduler},
  author = {Oikonomou, Dimitris and Loizou, Nicolas},
  booktitle = {ICML},
  year   = {2025},
}

License

Released under the MIT License.