Harmony in Divergence: Towards Fast, Accurate, and Memory-efficient Zeroth-order LLM Fine-tuning

November 3, 2025 · View on GitHub

Implementation of Neurips 2025 paper DiZO.

Installation

pip install -r requirements.txt

Usage

cd to ./large_models first.

Use run.py for all functions (zero-shot/ICL/fine-tuning/MeZO/DiZO):

python run.py {ARGUMENTS}

Please read run.py for a complete list of arguments. We introduce some of the most important ones below.

The first part is also used in MeZO.

  • --num_train : Number of training examples. For ICL, this is the number of demonstrations.

  • --num_dev : Number of validation examples.

  • --num_test : Number of testing examples.

  • --model_name : HuggingFace model name or path.

  • --task_name : Task name.

  • --trainer : can be none (zero-shot/ICL), regular (fine-tuning), or zo (MeZO).

  • --zo_eps : ZO hyperparameter epsilon for weight update.

  • --prefix_tuning : use prefix-tuning.

  • --lora : use LoRA.


The second part is the new introduced arguments in DiZO.

  • --enhanced : wheather involving projection (DiZO) into ZO training, need to set --trainer=zo

  • --interval: training step interval to update the projection (κ\kappa).

  • --zo_eps_projection: ZO hyperparameter epsilon for projection update.

  • --step_size_projection: step size for projection.

  • --clip_range: τ\tau for projection clipping.

We provide example scripts below for reproducing our experiments.

# do not involve $\gamma$ (original MeZO)
MODEL=facebook/opt-2.7b TASK=SST2 MODE=ft LR=1e-6 EPS=1e-3 STEPS=4000 bash dizo.sh

# use zeroth-order optimization for $\gamma$ projection searching
MODEL=facebook/opt-2.7b TASK=SST2 MODE=ft LR=1e-6 EPS=1e-3 STEPS=4000 ENHANCED=zo ZO_EPS_PROJECTION=0.1 STEP_SIZE_PROJECTION=2.0 CLIP_RANGE=0.2 bash dizo.sh

# use first-order optimization for $\gamma$ projection searching
MODEL=facebook/opt-2.7b TASK=SST2 MODE=ft LR=1e-6 EPS=1e-3 STEPS=4000 ENHANCED=fo INTERVAL=50 bash dizo.sh

Zeroth-order optimization is sensitive to the choice of hyperparameters. Our recommended newly introduced hyperparameter search range are as follows. Empirically, more aggresive projection is better for smaller model and easier tasks, and vise versa.

DiZO methodsSuggested Value
interval50/100/200/400
zo_eps_projection0.1/0.05
step_size_projection2.0/1.0
clip_range0.1/0.2/0.3

We have provided two additional log files (DiZO and MeZO on SST2) to verify that your code is running correctly. The exact numbers may vary slightly depending on the specific devices.

How to incorporate MeZO or DiZO

Please refer to trainer.py for details. The _inner_training_loop function is edited, please replace the original training loop with the new edited one. For DiZO, to see where we edited, search DiZO added.

Acknowledgement

Thanks to MeZO (Memory-efficient Zeroth-order Optimization) open source their code with detailed Readme, you may try to find solution in MeZO's repo if there is any bugs in runing DiZO.

Citation

@article{tan2025harmony,
  title={Harmony in divergence: Towards fast, accurate, and memory-efficient zeroth-order llm fine-tuning},
  author={Tan, Qitao and Liu, Jun and Zhan, Zheng and Ding, Caiwei and Wang, Yanzhi and Ma, Xiaolong and Lee, Jaewoo and Lu, Jin and Yuan, Geng},
  journal={arXiv preprint arXiv:2502.03304},
  year={2025}
}