RoSTE: An Efficient Quantization-Aware Supervised Fine-Tuning Approach for Large Language Model
May 27, 2025 ยท View on GitHub
This repository contains the code of RoSTE introduced in our work: "RoSTE: An Efficient Quantization-Aware Supervised Fine-Tuning Approach for Large Language Models"



Installation
install environment
git clone https://github.com/Hong-Lab-UMN-ECE/RoSTE
cd RoSTE
conda create -n roste python==3.10 -y
conda activate roste
pip install -r requirements.txt
install the Fast Hadamard Transform package
git clone https://github.com/Dao-AILab/fast-hadamard-transform
cd fast-hadamard-transform
pip install .
Usage
We provide the details of TL;DR summarization experiments for the Pythia and Qwen models.
Dataset: TL;DR
Models:
Pre-Processing (optional)
To keep computational invariance when using rotation, we need to transform Pythia models to Llama/Qwen-type model architecture. More specifically, we modify the Q, K, and V linear layers of the MHSA module in Pythia. The transformed Pythia models are also compatible with QuaRot and SpinQuant.
python convert_pythia_to_llama_format.py --legacy_model_dir EleutherAI/pythia-1b-deduped --new_model_dir ./save/pythia-1b/ckpt/pythia-1b-deduped-new
Optimal Rotation Configuration Search

We keep R1 as offline mergeable rotation and R2, R3, R4 as online rotations during training. Before inference, R2 can be merged into weights.
We first fuse norm (LayerNorm / RMSNorm) into weights and apply R1.
python rotate_model_r1.py --model_dir Qwen/Qwen2.5-0.5B --is_tldr_data --is_rotate_R1 --is_save --rotated_model_dir ./save/qwen2.5-0.5b/ckpts/qwen2.5-0.5b-r1
Then we compute the quantization error
python rotate_model_r234_quant_error.py --model_dir ./save/qwen2.5-0.5b/ckpt/qwen2.5-0.5b-r1 --output_folder ./rotation_config/qwen/
Next we search the optimal rotation configuration based on two quantization error logs.
python rotate_model_r234_search_config.py --output_folder ./rotation_config/qwen/
Training
We provide three training methods: SFT, QA-SFT with STE, and QA-SFT with RoSTE.
SFT
accelerate launch \
--config_file configs/ds_z3.yaml \
train_sft.py \
--config configs/recipes/qwen2.5_7b_sft.yaml
QA-SFT with STE
accelerate launch \
--config_file configs/ds_z3.yaml \
train_qa_sft_ste.py \
--config configs/recipes/qwen2.5_7b_qa_sft_ste.yaml
QA-SFT with RoSTE
accelerate launch \
--config_file configs/ds_z3.yaml \
train_qa_sft_roste.py \
--config configs/recipes/qwen2.5_7b_qa_sft_roste.yaml
Evaluation
We evaluate the final models on the TL;DR test dataset, which supports multi-GPU inference.
accelerate launch eval_tldr.py --model_dir Qwen/Qwen2.5-0.5B --method base --batch_size 8
Acknowledgement
Our code implementation is built upon open-source projects TL;DR Summarization, Tulu 3 and Huggingface TRL. The implementation of rotation is based on QuaRot and SpinQuant. We sincerely appreciate the efforts of these teams for their contributions to open-source research and development.
Citation
If you find our work useful in your research please consider citing our paper:
@article{wei2025roste,
title={RoSTE: An Efficient Quantization-Aware Supervised Fine-Tuning Approach for Large Language Models},
author={Wei, Quan and Yau, Chung-Yiu and Wai, Hoi-To and Zhao, Katie Yang and Kang, Dongyeop and Park, Youngsuk and Hong, Mingyi and others},
journal={arXiv preprint arXiv:2502.09003},
year={2025}
}