GeN: Generalized Newton's Method for Learning-Rate-Free Optimization ๐Ÿš€

June 21, 2025 ยท View on GitHub


Paper: Gradient Descent with Generalized Newtonโ€™s Method (ICLR 2024)


๐Ÿ“ฆ Repository Overview

This repository contains the code and examples for Generalized Newton's method as a learning-rate-free optimization. It supports a wide range of models and tasks, including:

  • ๐Ÿ–ผ๏ธ Image classification (CIFAR10/CIFAR100/ImageNet... datasets with ViT/ResNet models)
  • ๐Ÿ“ Natural language generation (E2E/DART... datasets with GPT2 models)
  • ๐Ÿ“Š Natural language understanding (SST2/QNLI/MNLI... datasets with BERT/RoBERTa models)
  • ๐Ÿ•ต๏ธโ€โ™‚๏ธ Object detection / Instance segmentation
  • ๐ŸŽฏ Recommendation system

Example scripts are provided for each task in the examples/ directory. The core implementation of GeN optimizer can be found in GeN/, which roughly has the same speed and memory cost as the base optimizers.

โšก Quickstart

๐Ÿ› ๏ธ Installation

Install the package from PyPI:

pip install gen-optim

Alternatively, install the latest version directly from GitHub:

pip install git+https://github.com/ShiyunXu/gen-optim

๐Ÿƒ Minimal Training Loop

To use GeN in your PyTorch training loop, simply add two lines between backward() and optimizer.step():

from GeN import lr_parabola
optimizer = AdamW(model.parameters(), lr=1e-4)
tr_iter = iter(train_loader)

# Standard training pipeline
loss = F.cross_entropy(model(batch), labels)
loss.backward()
if (batch_idx+1) % lazy_freq == 0:
    lr_parabola(model, optimizer, tr_iter=tr_iter, task='image_cls', scale=scale)
optimizer.step()
optimizer.zero_grad()
  • scale can be used to enable the horizon-aware learning rate (e.g., np.linspace(1,0,epochs+1)).
  • Call lr_parabola infrequently (a.k.a. lazy update) by setting lazy_freq>=4 for efficiency.
  • Different task values need different forward passes. Can be customized.

๐Ÿงฉ Function Overview

The main function is lr_parabola, which adapts the learning rate based on a quadratic curve fitting to the loss landscape, with minimal code changes and computational overhead. This enables learning-rate-free optimization and leverages the Hessian information, like the Newtonโ€“Raphson method.

Mathematically, we turn any base optimizer (e.g. SGD or AdamW) to the GeN optimizer by

Update rule

where g_t is the stochastic pre-conditioned gradient, G_t is the oracle gradient and H_t is the oracle Hessian.

To enable the horizon-aware GeN, like cosine or linear decay learning rates, we use hyperparameter-free one-to-zero decay (controlled by `scale`):

Update rule ย ย ย ย ย  Update rule

โœจ Highlights

๐Ÿงช Synthetic data

Beale all
Figure 1: Beale (convex) trajectories (all).
Beale ours vs non-ours
Figure 2: Beale (convex) ours vs non-ours.
Rosenbrock all
Figure 3: Rosenbrock (non-convex) trajectories (all).
Rosenbrock ours vs non-ours
Figure 4: Rosenbrock (non-convex) ours vs non-ours.

๐Ÿ–ผ๏ธ Image Classification

๐Ÿ“Š Natural Language Understanding

๐Ÿ“ Natural Language Generation

๐Ÿ•ต๏ธโ€โ™‚๏ธ Object Detection & Instance Segmentation

๐ŸŽฏ Recommendation System

๐Ÿ“š Citation

If you use GeN in your research, please cite:

@inproceedings{bu2024gradient,
  title={Gradient descent with generalized newtonโ€™s method},
  author={Bu, Zhiqi and Xu, Shiyun},
  booktitle={The Thirteenth International Conference on Learning Representations},
  year={2024}
}