ASGO: Adaptive Structured Gradient Optimization
October 29, 2025 · View on GitHub
Official implementation of the paper "ASGO: Adaptive Structured Gradient Optimization" by Kang An, Yuxing Liu, Rui Pan, Yi Ren, Shiqian Ma, Donald Goldfarb, and Tong Zhang.
Paper: https://arxiv.org/abs/2503.20762
Overview
ASGO is an optimizer that combines adaptive preconditioning with structured gradient optimization for training deep neural networks. This repository provides implementations of ASGO and related optimizers for language model training.
Repository Structure
ASGO/
├── optimizers/ # Core optimizer implementations
│ ├── asgo.py # Adaptive Structured Gradient Optimization
│ ├── dasgo.py # Diagonal Adaptive Structured Gradient Optimization
│ ├── muon.py # Muon optimizer
│ └── shampoo.py # Shampoo optimizer
├── LLMmodels/ # Language model training scripts
│ ├── config/ # Hydra configuration files
│ ├── utils/ # Training utilities
│ └── main.py # Main training script
└── requirements.txt # Python dependencies
Installation
pip install -r requirements.txt
Quick Start
To run the training script:
cd LLMmodels
bash script.sh
The script will interactively prompt for:
- CUDA_VISIBLE_DEVICES (e.g., 0,1,2,3)
- Optimizer choice (e.g., asgo, muon, shampoo)
- Weights & Biases project name
- Dataset Path to download and load the Data
Manual Training
For more control, you can run the training script directly:
python -m torch.distributed.run \
--standalone \
--nproc_per_node=4 \
LLMmodels/main.py \
dataset=openwebtext \
model=gpt2 \
optimizer=asgo \
train.batch_size=32 \
train.train_steps=2400
Available Optimizers
asgo: Adaptive Structured Gradient Optimization (main method)dasgo: Diagonal ASGO variantmuon: Muon optimizershampoo: Shampoo optimizeradamw: adamw optimizer
Configuration
Training configurations are managed using Hydra. Configuration files are located in LLMmodels/config/:
model/: Model architectures (gpt2, nanogpt)optimizer/: Optimizer hyperparametersdataset/: Dataset configurationstrain_config.yaml: Main training configuration
Override any parameter from the command line:
python LLMmodels/main.py optimizer.learning_rate=0.01 train.batch_size=64
Citation
@inproceedings{an2025asgo,
title={ASGO: Adaptive Structured Gradient Optimization},
author={An, Kang and Liu, Yuxing and Pan, Rui and Ren, Yi and Ma, Shiqian and Goldfarb, Donald and Zhang, Tong},
booktitle={NeurIPS},
year={2025},
url={https://arxiv.org/abs/2503.20762}
}
License
MIT License