LLM fine-tuning experiments
May 14, 2026 ยท View on GitHub
Our code is primarily based on ZO-Muon, LOZO, HiZOO and MeZO.
Installation
conda create -n zo python==3.9.19
conda activate zo
pip install -r requirements.txt
This environment supports fine-tuning the OPT, Llama3 and Gemma2 models.
Usage
Our proposed methods
Below is an example command for evaluating our proposed ZO-MOPI on OPT-13B RTE fine-tuning.
CUDA_VISIBLE_DEVICES=0 MODEL=facebook/opt-13b TASK=RTE MODE=ft LR=1e-2 BS=16 EPS=1e-3 RANK=64 STEP_INTERVAL=500 MULTIPLE_SAMPLE=True NUM_SAMPLES=8 STEPS=4000 EVAL_STEPS=1000 bash scripts/zo_mopi.sh
Zeroth-Order Baselines
Run MeZO via:
CUDA_VISIBLE_DEVICES=0 MODEL=facebook/opt-13b TASK=RTE MODE=ft LR=1e-7 BS=16 EPS=1e-3 STEPS=20000 EVAL_STEPS=20000 bash scripts/mezo.sh
Run LOZO:
CUDA_VISIBLE_DEVICES=0 MODEL=facebook/opt-13b TASK=RTE MODE=ft LR=1e-7 BS=16 EPS=1e-3 RANK=4 STEP_INTERVAL=100 STEPS=20000 EVAL_STEPS=20000 bash scripts/lozo.sh
For the ZO-Muon variant where gradient orthogonalization is solved by SVD, we set $OPT to muon_svd:
CUDA_VISIBLE_DEVICES=0 MODEL=facebook/opt-13b TASK=RTE MODE=ft LR=1e-2 BS=16 EPS=1e-3 RANK=64 STEP_INTERVAL=100 OPT='muon_svd' MULTIPLE_SAMPLE=True NUM_SAMPLES=4 STEPS=8000 EVAL_STEPS=8000 bash scripts/lowdim.sh
Compare with same Runtime
We can compare difference ZO methods with the same training runtime by setting $MAX_TIME, measured in seconds. See an example below:
CUDA_VISIBLE_DEVICES=0 MODEL=facebook/opt-13b TASK=RTE MODE=ft LR=1e-2 BS=16 EPS=1e-3 RANK=64 STEP_INTERVAL=100 OPT='muon' MULTIPLE_SAMPLE=True NUM_SAMPLES=4 STEPS=8000 EVAL_STEPS=8000 MAX_TIME=5000 bash scripts/lowdim.sh
First-Order Methods
Full Adam fine-tuning:
CUDA_VISIBLE_DEVICES=0,1,2,3 MODEL=facebook/opt-13b TASK=SST2 MODE=ft LR=1e-5 bash finetune.sh
LoRA fine-tuning:
CUDA_VISIBLE_DEVICES=0,1,2,3 MODEL=facebook/opt-13b TASK=SST2 MODE=lora LR=1e-5 bash finetune.sh