OSTQuant

April 8, 2025 · View on GitHub

Official code for ICLR2025 paper OSTQuant: Refining Large Language Model Quantization with Orthogonal and Scaling Transformations for Better Distribution Fitting

Motivation

Our motivation Fig1 : Transformation of a batch of data $X \sim \mathcal{N}({\mu}, {\Sigma})$ using different methods. Eigenvalues $\lambda_1$ and $\lambda_2$ represent the spread of the distribution along principal axes after eigenvalue decomposition of $\Sigma$ . (a) shows the original distribution, while (b), (c), and (d) illustrate the effects of the Smooth-based, Rotate-base, and ours OST-based methods, respectively, on QSUR.

OSTQuant Diagram

Fig2 : The overall flow diagram of OSTQuant. The top section of the figure illustrates how the global orthogonal transformation, $R_{res}$ , along with the two scaling transformations, $S_{attn}$ and $S_{ffn}$ , collaborate within each block to adjust the distributions across the entire network while maintaining computational invariance. The bottom section highlights four equivalent transformation pairs applied to the FFN and Self-Attention layers. Each fully-connected (FC) layer’s activation and weight are influenced by one or more of these transformation pairs. During runtime, these transformation pairs are fused with the weights, ensuring minimal runtime overhead.

Evaluate from ours optimized transformation matrix weights

We have provided the optimized transformation matrix weights in google drive.

You can choose to load the checkpoints we provide to evaluate directly.

python main.py --output_dir output/llama2_w4a16kv16 --model weights/Llama-2-7b-hf  \
 --loss_type=kl_top --post_attn=True \
 --rotate_ov=True --rotate_post_rope=False --online_qk_hadamard=False --smooth_qk=True --smooth_ov=True --smooth_up_down=True --smooth_norm_linear=True --bf16=True --lm_eval=True --per_device_train_batch_size=4 \
 --max_steps=100 --a_bits=16 --v_bits=16 --k_bits=16 --down_bits=16 --w_clip=True\
 --train_enable_wquant=True --sub_mean False --train_rotate=False --resume_path=checkpoints/llama2_7b_w4a16kv16.bin

Reproduce from scratch

You can repdoduce our results with the following command:

export CUDA_VISIBLE_DEVICES="0,1,2,3"
sh scripts/w4a16kv16.sh

Main Results

Table：Comparison of perplexity on WikiText2 and averaged accuracy on nine Zero-Shot tasks. Results for SmoothQuant, GPTQ, OmniQuant, AWQ, and QuaRot are based on official code and SpinQuant's results for LLaMA-2/3 using official weights, with LLaMA-1 from the official code.

#Bits W-A-KV	Method	LLaMA-3 8B		LLaMA-2 7B		LLaMA-2 13B		LLaMA 7B		LLaMA 13B		LLaMA 30B
		0-shot⁹	Wiki	0-shot⁹	Wiki	0-shot⁹	Wiki	0-shot⁹	Wiki	0-shot⁹	Wiki	0-shot⁹	Wiki
		Avg.(↑)	(↓)	Avg.(↑)	(↓)	Avg.(↑)	(↓)	Avg.(↑)	(↓)	Avg.(↑)	(↓)	Avg.(↑)	(↓)
16-16-16	FloatingPoint	68.09	6.14	65.21	5.47	67.61	4.88	64.48	5.68	66.67	5.09	70.00	4.10
4-16-16	RTN	63.70	8.13	61.27	7.02	60.24	6.39	62.67	7.94	63.45	8.60	65.69	6.13
	SmoothQuant	62.79	8.12	58.88	8.03	62.03	5.86	62.24	7.46	62.69	18.75	65.69	5.80
	GPTQ	61.03	7.43	60.86	9.84	64.71	5.79	60.15	7.93	64.36	6.58	66.95	5.26
	OmniQuant	65.66	7.19	63.19	5.74	66.38	5.02	63.42	5.86	66.22	5.21	69.07	4.25
	AWQ	67.03	7.36	63.89	5.83	66.25	5.07	63.30	5.97	65.58	5.28	69.44	4.28
	QuaRot	67.27	6.53	64.30	5.62	66.95	5.00	63.40	5.83	65.91	5.20	69.73	4.27
	SpinQuant	66.54	6.49	63.59	5.58	67.14	5.00	63.94	5.76	66.32	5.16	69.62	4.21
	OSTQuant	67.80	6.53	64.37	5.64	67.31	4.94	64.13	5.81	66.62	5.21	69.84	4.19
4-4-16	RTN	33.42	6e2	32.44	nan	30.86	8e3	32.51	7e3	31.63	3e4	31.57	2e3
	SmoothQuant	33.04	1e3	32.13	nan	34.26	1e3	34.42	3e2	33.29	6e2	34.64	1e3
	GPTQ	32.98	5e2	32.72	nan	30.11	4e3	32.12	1e3	31.51	3e3	30.88	2e3
	QuaRot	61.69	8.02	61.87	6.05	65.13	5.35	61.76	6.22	64.46	5.50	68.14	4.57
	SpinQuant	64.11	7.28	57.37	6.78	63.23	5.24	61.82	6.08	64.59	5.36	68.08	4.53
	OSTQuant	65.14	7.24	63.90	5.60	66.24	5.14	62.72	6.04	65.80	5.40	68.52	4.43
4-4-4	RTN	33.18	7e2	32.67	nan	30.93	7e3	32.87	1e4	31.33	3e4	31.64	2e3
	SmoothQuant	32.96	1e3	32.12	nan	33.36	1e3	33.32	3e2	33.28	5e2	34.65	1e3
	GPTQ	33.71	6e2	33.52	nan	27.85	5e3	31.80	2e3	30.63	3e3	31.07	2e3
	OmniQuant	32.33	4e2	48.40	14.26	50.35	12.30	48.46	11.26	45.63	10.87	45.04	12.35
	QuaRot	61.38	8.18	61.48	6.11	65.16	5.39	61.22	6.26	64.59	5.53	68.08	4.60
	SpinQuant	64.10	7.35	62.01	5.96	64.13	5.74	61.32	6.12	64.95	5.39	68.14	4.55
	OSTQuant	65.37	7.29	63.18	5.91	65.41	5.25	62.55	6.07	65.43	5.40	68.20	4.42

@inproceedings{hu2025ostquant,
title={{OSTQ}uant: Refining Large Language Model Quantization with Orthogonal and Scaling Transformations for Better Distribution Fitting},
author={Xing Hu, Yuan Cheng, Dawei Yang, Zhixuan Chen, Zukang Xu, Jiangyong Yu, Chen Xu, Zhihang Yuan, Zhe jiang and Sifan Zhou},
booktitle={The Thirteenth International Conference on Learning Representations},
year={2025},
url={https://openreview.net/forum?id=rAcgDBdKnP}
}

OSTQuant

Motivation

OSTQuant Diagram

Evaluate from ours optimized transformation matrix weights

Reproduce from scratch

Main Results

Table：Comparison of perplexity on WikiText2 and averaged accuracy on nine Zero-Shot tasks. Results for SmoothQuant, GPTQ, OmniQuant, AWQ, and QuaRot are based on official code and SpinQuant's results for LLaMA-2/3 using official weights, with LLaMA-1 from the official code.

Contributing

Future Work

OSTQuant for FullyQuant.

Citation

Star History