RoSTE: An Efficient Quantization-Aware Supervised Fine-Tuning Approach for Large Language Model

May 27, 2025 · View on GitHub

This repository contains the code of RoSTE introduced in our work: "RoSTE: An Efficient Quantization-Aware Supervised Fine-Tuning Approach for Large Language Models"

Installation

install environment

git clone https://github.com/Hong-Lab-UMN-ECE/RoSTE
cd RoSTE

conda create -n roste python==3.10 -y
conda activate roste
pip install -r requirements.txt

install the Fast Hadamard Transform package

git clone https://github.com/Dao-AILab/fast-hadamard-transform
cd fast-hadamard-transform
pip install .

Usage

We provide the details of TL;DR summarization experiments for the Pythia and Qwen models.

Dataset: TL;DR

Models:

To keep computational invariance when using rotation, we need to transform Pythia models to Llama/Qwen-type model architecture. More specifically, we modify the Q, K, and V linear layers of the MHSA module in Pythia. The transformed Pythia models are also compatible with QuaRot and SpinQuant.

python convert_pythia_to_llama_format.py --legacy_model_dir EleutherAI/pythia-1b-deduped --new_model_dir ./save/pythia-1b/ckpt/pythia-1b-deduped-new

Optimal Rotation Configuration Search

Optimal Rotation Configuration

We keep R1 as offline mergeable rotation and R2, R3, R4 as online rotations during training. Before inference, R2 can be merged into weights.

We first fuse norm (LayerNorm / RMSNorm) into weights and apply R1.

python rotate_model_r1.py --model_dir Qwen/Qwen2.5-0.5B --is_tldr_data --is_rotate_R1 --is_save --rotated_model_dir ./save/qwen2.5-0.5b/ckpts/qwen2.5-0.5b-r1

Then we compute the quantization error

python rotate_model_r234_quant_error.py --model_dir ./save/qwen2.5-0.5b/ckpt/qwen2.5-0.5b-r1 --output_folder ./rotation_config/qwen/

Next we search the optimal rotation configuration based on two quantization error logs.

python rotate_model_r234_search_config.py --output_folder ./rotation_config/qwen/

Training

We provide three training methods: SFT, QA-SFT with STE, and QA-SFT with RoSTE.

SFT

accelerate launch \
    --config_file configs/ds_z3.yaml \
    train_sft.py \
    --config configs/recipes/qwen2.5_7b_sft.yaml

QA-SFT with STE

accelerate launch \
    --config_file configs/ds_z3.yaml \
    train_qa_sft_ste.py \
    --config configs/recipes/qwen2.5_7b_qa_sft_ste.yaml

QA-SFT with RoSTE

accelerate launch \
    --config_file configs/ds_z3.yaml \
    train_qa_sft_roste.py \
    --config configs/recipes/qwen2.5_7b_qa_sft_roste.yaml

Evaluation

We evaluate the final models on the TL;DR test dataset, which supports multi-GPU inference.

accelerate launch eval_tldr.py --model_dir Qwen/Qwen2.5-0.5B --method base --batch_size 8

Acknowledgement

Our code implementation is built upon open-source projects TL;DR Summarization, Tulu 3 and Huggingface TRL. The implementation of rotation is based on QuaRot and SpinQuant. We sincerely appreciate the efforts of these teams for their contributions to open-source research and development.

Citation

If you find our work useful in your research please consider citing our paper:

@article{wei2025roste,
  title={RoSTE: An Efficient Quantization-Aware Supervised Fine-Tuning Approach for Large Language Models},
  author={Wei, Quan and Yau, Chung-Yiu and Wai, Hoi-To and Zhao, Katie Yang and Kang, Dongyeop and Park, Youngsuk and Hong, Mingyi and others},
  journal={arXiv preprint arXiv:2502.09003},
  year={2025}
}