ClippedScion
April 1, 2026 · View on GitHub
Code accompanying the paper Generalized Gradient Norm Clipping & Non-Euclidean -Smoothness.
This paper is a following work of Training Deep Learning Models with Norm-Constrained LMOs and is based on the Scion codebase.
Repository structure
clippedscion.py: Contains theUnconstrainedClippedScionandClippedScionreference implementation along with various norm choices.- Algorithm 3 corresponds to
UnconstrainedClippedScion. - Algorithm 4 (Variant 2) corresponds to
ClippedScion. For simplicity, we control in practice.
- Algorithm 3 corresponds to
examples/: Example usage containing airbench, nanoGPT, and DeiT experiments with and without weight sharing.
Notes
The ClippedScion optimizer comes with a couple of hyperparameters:
momentum: The parameter is1-usual_momentumof e.g. the PyTorch implementation of SGD with momentum. A good default is 0.1. Higher values seem to work better (e.g. 0.5) for short training runs with low noise as also supported by theory.scale: Controls the per-layer constraint radius factor. The layerwise radius can be tuned on a small proxy model similarly to the input and output scaling factor of µP.lr: The learning rate can similarly be tuned on a small proxy model (corresponds to γ in the paper).unconstrained: When set toFalsethe constrained variant of the ClippedScion is used, which guarantees the iterates to stay bounded.rho: Clipping threshold controls in Algorithm 3 & 4.
Architectural changes:
- Scale activation functions (ReLU, GELU) by √2 to maintain the input variance.
Examples
For runnable examples see examples/.
Below are some pseudocode configurations for different architectures and domains (see Appendix C for exact parameter choices):
-
nanoGPT with weight sharing (see
examples/modded-nanogpt):radius = 50.0 threshold = 600 optim_groups = [{ 'params': model.transformer.h.parameters(), 'norm': 'Spectral', 'norm_kwargs': {}, 'scale': radius, }, { 'params': model.lm_head.parameters(), 'norm': 'Sign', 'norm_kwargs': {}, 'scale': radius*60.0, }] optimizer = UnconstrainedClippedScion(optim_groups, lr=2**-12, momentum=0.1, rho=600) -
CNN (see
examples/airbenchfor further details):radius = 8.0 threshold = 1600 optim_groups = [{ 'params': remaining_parameters, 'norm': 'Auto', # Picks layerwise norm based on the parameter shape 'norm_kwargs': {}, 'scale': radius, }, { 'params': output_layer, 'norm': 'Sign', 'norm_kwargs': {'normalized': True}, 'scale': radius*16, }] optimizer = UnconstrainedClippedScion(optim_groups, lr=2**-4, momentum=0.5, rho=1600) -
DeiT
radius = 25 threshold = 8000 optim_groups = [{ 'params': other_params, 'norm': 'Auto', 'norm_kwargs': {}, 'scale': radius, },{ 'params': head_weights, 'norm': 'Sign', 'norm_kwargs': {}, 'scale': radius*20, },{ 'params': [pos_embed_param, cls_token_param], 'norm': 'BiasRMS', 'norm_kwargs': {}, 'scale': radius, }] optimizer = UnconstrainedClippedScion(optim_groups, lr=8e-5, momentum=0.1, rho=8000)
Citation
If you find this work useful, please cite it as follows:
@article{pethick2025generalized,
title={Generalized Gradient Norm Clipping \& Non-Euclidean $(L\_0, L\_1) $-Smoothness},
author={Pethick, Thomas and Xie, Wanyun and Erdogan, Mete and Antonakopoulos, Kimon and Silveti-Falls, Tony and Cevher, Volkan},
journal={Advances in Neural Information Processing Systems (NeurIPS)},
year={2025}
}