GeN: Generalized Newton's Method for Learning-Rate-Free Optimization ๐
June 21, 2025 ยท View on GitHub
Paper: Gradient Descent with Generalized Newtonโs Method (ICLR 2024)
๐ฆ Repository Overview
This repository contains the code and examples for Generalized Newton's method as a learning-rate-free optimization. It supports a wide range of models and tasks, including:
- ๐ผ๏ธ Image classification (CIFAR10/CIFAR100/ImageNet... datasets with ViT/ResNet models)
- ๐ Natural language generation (E2E/DART... datasets with GPT2 models)
- ๐ Natural language understanding (SST2/QNLI/MNLI... datasets with BERT/RoBERTa models)
- ๐ต๏ธโโ๏ธ Object detection / Instance segmentation
- ๐ฏ Recommendation system
Example scripts are provided for each task in the examples/ directory. The core implementation of GeN optimizer can be found in GeN/, which roughly has the same speed and memory cost as the base optimizers.
โก Quickstart
๐ ๏ธ Installation
Install the package from PyPI:
pip install gen-optim
Alternatively, install the latest version directly from GitHub:
pip install git+https://github.com/ShiyunXu/gen-optim
๐ Minimal Training Loop
To use GeN in your PyTorch training loop, simply add two lines between backward() and optimizer.step():
from GeN import lr_parabola
optimizer = AdamW(model.parameters(), lr=1e-4)
tr_iter = iter(train_loader)
# Standard training pipeline
loss = F.cross_entropy(model(batch), labels)
loss.backward()
if (batch_idx+1) % lazy_freq == 0:
lr_parabola(model, optimizer, tr_iter=tr_iter, task='image_cls', scale=scale)
optimizer.step()
optimizer.zero_grad()
scalecan be used to enable the horizon-aware learning rate (e.g.,np.linspace(1,0,epochs+1)).- Call
lr_parabolainfrequently (a.k.a. lazy update) by setting lazy_freq>=4 for efficiency. - Different
taskvalues need different forward passes. Can be customized.
๐งฉ Function Overview
The main function is lr_parabola, which adapts the learning rate based on a quadratic curve fitting to the loss landscape, with minimal code changes and computational overhead. This enables learning-rate-free optimization and leverages the Hessian information, like the NewtonโRaphson method.
Mathematically, we turn any base optimizer (e.g. SGD or AdamW) to the GeN optimizer by
To enable the horizon-aware GeN, like cosine or linear decay learning rates, we use hyperparameter-free one-to-zero decay (controlled by `scale`):
ย ย ย ย ย
โจ Highlights
๐งช Synthetic data
Figure 1: Beale (convex) trajectories (all). |
Figure 2: Beale (convex) ours vs non-ours. |
Figure 3: Rosenbrock (non-convex) trajectories (all). |
Figure 4: Rosenbrock (non-convex) ours vs non-ours. |
๐ผ๏ธ Image Classification
๐ Natural Language Understanding
๐ Natural Language Generation
๐ต๏ธโโ๏ธ Object Detection & Instance Segmentation
๐ฏ Recommendation System
๐ Citation
If you use GeN in your research, please cite:
@inproceedings{bu2024gradient,
title={Gradient descent with generalized newtonโs method},
author={Bu, Zhiqi and Xu, Shiyun},
booktitle={The Thirteenth International Conference on Learning Representations},
year={2024}
}