AutoRound Quantization

March 2, 2026 · View on GitHub

llm-compressor supports AutoRound, an advanced quantization technique that delivers high-accuracy, low-bit quantization. The quantized results are fully compatible with compressed-tensors and can be served directly with vLLM.

AutoRound introduces three trainable parameters (V, α, and β) to optimize rounding values and clipping ranges during quantization. The method processes each decoder layer sequentially, using block-wise output reconstruction error as the training objective to fine-tune these parameters. This approach combines the efficiency of post-training quantization with the adaptability of parameter tuning, delivering robust compression for large language models while maintaining strong performance.

Installation

To get started, install:

git clone https://github.com/vllm-project/llm-compressor.git
cd llm-compressor
pip install -e .

Quickstart

The example includes end-to-end scripts for applying the AutoRound quantization algorithm.

Llama 3.1 Example

python3 llama3.1_example.py

The resulting model Meta-Llama-3.1-8B-Instruct-NVFP4-AutoRound is ready to be loaded into vLLM.

Evaluate Accuracy

With the model created, we can now load and run in vLLM (after installing).

from vllm import LLM
model = LLM("./Meta-Llama-3.1-8B-Instruct-NVFP4-AutoRound")

Note: quantized models can be sensitive to the presence of the bos token. lm_eval does not add a bos token by default, so make sure to include the add_bos_token=True argument when running your evaluations.

Run the following to test accuracy on GSM-8K:

lm_eval --model vllm \
  --model_args pretrained="./Meta-Llama-3.1-8B-Instruct-NVFP4-AutoRound",add_bos_token=true \
  --tasks gsm8k \
  --num_fewshot 5 \
  --batch_size 'auto'

meta-llama/Meta-Llama-3.1-8B-Instruct

Tasks	Version	Filter	n-shot	Metric		Value		Stderr
gsm8k	3	flexible-extract	5	exact_match	↑	0.7710	±	0.0116
		strict-match	5	exact_match	↑	0.7043	±	0.0126

Meta-Llama-3.1-8B-Instruct-NVFP4 (QuantizationModifier)

Tasks	Version	Filter	n-shot	Metric		Value		Stderr
gsm8k	3	flexible-extract	5	exact_match	↑	0.7248	±	0.0123
		strict-match	5	exact_match	↑	0.6611	±	0.0130

Meta-Llama-3.1-8B-Instruct-NVFP4-AutoRound (AutoRoundModifier, iters=0)

Tasks	Version	Filter	n-shot	Metric		Value		Stderr
gsm8k	3	flexible-extract	5	exact_match	↑	0.7362	±	0.0121
		strict-match	5	exact_match	↑	0.6702	±	0.0129

Meta-Llama-3.1-8B-Instruct-NVFP4-AutoRound (AutoRoundModifier, iters=200)

Tasks	Version	Filter	n-shot	Metric		Value		Stderr
gsm8k	3	flexible-extract	5	exact_match	↑	0.7210	±	0.0124
		strict-match	5	exact_match	↑	0.6945	±	0.0127

Note: quantized model accuracy may vary slightly due to nondeterminism.

Qwen3-VL Example

python3 qwen3_vl_example.py

The resulting model Qwen3-VL-8B-Instruct-NVFP4-AutoRound is ready to be loaded into vLLM.

Evaluate Accuracy

Run the following to test accuracy on GSM-8K and ChartQA:

lm_eval --model vllm-vlm \
  --model_args pretrained="./Qwen3-VL-8B-Instruct-NVFP4-AutoRound",add_bos_token=true \
  --tasks gsm8k \
  --num_fewshot 5 \
  --batch_size 'auto'

lm_eval --model vllm-vlm \
  --model_args pretrained="./Qwen3-VL-8B-Instruct-NVFP4-AutoRound",add_bos_token=true \
  --tasks chartqa \
  --batch_size 'auto' \
  --apply_chat_template

Qwen/Qwen3-VL-8B-Instruct (Baseline)

Tasks	Version	Filter	n-shot	Metric		Value		Stderr
gsm8k	3	flexible-extract	5	exact_match	↑	0.8628	±	0.0095
		strict-match	5	exact_match	↑	0.8453	±	0.0100

Tasks	Version	Filter	Metric		Value		Stderr
chartqa	0	none	anywhere_accuracy	↑	0.7908	±	0.0081
		none	exact_match	↑	0.5592	±	0.0099
		none	relaxed_accuracy	↑	0.7696	±	0.0084

Qwen3-VL-8B-Instruct-NVFP4-AutoRound (AutoRoundModifier, iters=200)

Tasks	Version	Filter	n-shot	Metric		Value		Stderr
gsm8k	3	flexible-extract	5	exact_match	↑	0.8415	±	0.0101
		strict-match	5	exact_match	↑	0.8408	±	0.0101

Tasks	Version	Filter	Metric		Value		Stderr
chartqa	0	none	anywhere_accuracy	↑	0.8220	±	0.0077
		none	exact_match	↑	0.5748	±	0.0099
		none	relaxed_accuracy	↑	0.8044	±	0.0079

Note: quantized model accuracy may vary slightly due to nondeterminism.

Questions or Feature Request?

Please open up an issue on vllm-project/llm-compressor or intel/auto-round.