MoFaSGD: Low-rank Momentum Factorization for Memory Efficient Training
July 14, 2025 ยท View on GitHub
This repository contains the official implementation for the paper "Low-rank Momentum Factorization for Memory Efficient Training" (TMLR, 2025), which introduces Momentum Factorized SGD (MoFaSGD).
MoFaSGD is a memory-efficient optimizer that enables full-parameter model updates with a memory footprint comparable to parameter-efficient fine-tuning (PEFT) methods like LoRA. It achieves this by maintaining and dynamically updating a low-rank factorization of the first-order momentum at each training step.
This codebase is built upon the modded-nanogpt repository and is designed to run the key experiments from the paper.
Installation
-
Clone the Repository
Clone the repository and initialize the
GaLoresubmodule, which is used as a baseline in the paper.git clone --recurse-submodules https://github.com/pmahdavi/MoFaSGD.git cd MoFaSGDIf you have already cloned the repository without the submodules, you can initialize them with:
git submodule update --init --recursive -
Create and Activate the Conda Environment
The provided
environment.ymlfile contains all the necessary dependencies, including the specific PyTorch nightly build required for a consistent experimental setup.conda env create -f environment.yml conda activate mofasgdThis command will create a new conda environment named
mofasgdand install all packages, including the localGaLoresubmodule.
Running Paper Experiments
This repository allows for the execution of the NanoGPT pre-training experiments from the paper.
How it Works
The run.py script automatically loads the base configuration for the chosen optimizer from the corresponding YAML file in configs/optimizers/. You can then override any of these settings using the --config argument with a JSON string. The base configurations are set for rank 32 runs.
Example Commands
MoFaSGD (Ours)
This command runs the MoFaSGD experiment with rank 16 on 4 GPUs. It loads the base settings from configs/optimizers/mfsgd.yaml and applies the specific hyperparameters overrides for rank 16 setting.
python run.py --optimizer mfsgd --num-gpus 4 \
--config '{"lr": 0.0009, "rank": 16, "beta": 0.95, "eta1": 0.25, "eta2": 0, "use_current_projection": true, "use_ones_for_nonzero_s": false, "nesterov": false, "eps": 1e-6, "max_value": 1000, "warmup_steps": 300, "beta_start": 0.75, "beta_end": 0.95, "cooldown_frac": 0.4}'
GaLore (Baseline)
This command runs the GaLore baseline experiment with rank 16. It loads base settings from configs/optimizers/galore.yaml and applies the specific overrides for rank 16 setting.
python run.py --optimizer galore --num-gpus 4 \
--config '{"lr": 0.008, "group_params": {"rank": 16, "update_proj_gap": 150, "scale": 0.25, "proj_type": "std"}}'
LLaMA-Factory Implementation
For the instruction-tuning experiments on LLaMA-3.1, as detailed in the paper, please see our other repository: pmahdavi/llama-factory-mfsgd. That repository contains the implementation of MoFaSGD within the LLaMA-Factory framework.
Citation
If you find our work useful in your research, please consider citing our paper:
@article{
mahdavinia2025lowrank,
title={Low-rank Momentum Factorization for Memory Efficient Training},
author={Pouria Mahdavinia and Mehrdad Mahdavi},
journal={Transactions on Machine Learning Research},
issn={2835-8856},
year={2025},
url={https://openreview.net/forum?id=W3D3TVo9a3},
note={}
}
License
This project is licensed under the MIT License. See the LICENSE file for details.