Step-by-Step

May 11, 2026 · View on GitHub

English | 简体中文

This document presents step-by-step instructions for auto-round llm quantization. You can refer to vlms user guide for vlms quantization and diffusions user guide for diffusions quantization.

1 Prerequisite

Install auto-round or install from source

pip install auto-round

2 Prepare Calibration Dataset

Default Dataset

The NeelNanda/pile-10k in huggingface is adopted as the default calibration data and will be downloaded automatically from the datasets Hub. Other available datasets include:

  • swift/pile-val-backup from modelscope for addressing HF network issue
  • BAAI/CCI3-HQ for Chinese
  • codeparrot/github-code-clean for code
  • HuggingFaceH4/ultrachat_200k for chat data
  • madao33/new-title-chinese for Chinese
  • mbpp for code
  • openbmb/Ultra-FineWeb

Customized Dataset

  • Option 1: Pass a local json file path to dataset argument

  • Option 2: Register your dataset following the code and pass the new dataset and split args to initialize AutoRound object, e.g. autoround=Autoround(dataset="NeelNanda/pile-10k:train", ...)

  • Option 3: pass list of string or list of input_ids to dataset.

    def customized_data():
        # Important Notice!!! AutoRound will drop data < args.seqlen and truncate data to args.seqlen
        data = ["AutoRound is an advanced quantization algorithm for low-bits LLM inference" * 240]
        return data
    
    
    def customized_data_with_tokenizer(tokenizer, seqlen=2048):
        # Import notice!!! AutoRound will drop data < args.seqlen
        data = ["AutoRound is an advanced quantization algorithm for low-bits LLM inference" * 240]
        tokens = []
        for d in data:
            token = tokenizer(d, truncation=True, max_length=seqlen, return_tensors="pt").data
            tokens.append(token)
        return tokens
    

Dataset operations

Dataset combination:We support combination of different datasets and parametrization of calibration datasets by using --dataset ./tmp.json,NeelNanda/pile-10k:num=256,mbpp:num=128. Both local calibration file and huggingface dataset are supported. You could specify splits of a dataset by setting split=split1+split2.

Samples concatenation: An optional setting allows users to concatenate calibration samples using --dataset NeelNanda/pile-10k:concat=True. All samples will be concatenated first, then split into chunks of seqlen length.

Apply chat template: Using --dataset NeelNanda/pile-10k:apply_chat_template enables application of a chat template to the calibration data before tokenization. This is commonly used for instruct-style models during generation. To customize the system prompt, use:--dataset 'NeelNanda/pile-10k:apply_chat_template:system_prompt="You are a helpful assistant."'

Note: If the concatenation option is not enabled, samples shorter than args.seqlen will be dropped.

Please use ',' to split datasets, ':' to split parameters of a dataset and '+' to add values for one targeted parameter.

3 Quantization

Supported Quantization Schemes

AutoRound supports several Schemes:

  • W4A16(bits:4,group_size:128,sym:True,act_bits:16)
  • W8A16(bits:8,group_size:128,sym:True,act_bits:16)
  • W6A16(bits:6,group_size:128,sym:True,act_bits:16) — mlx format only
  • W5A16(bits:5,group_size:128,sym:True,act_bits:16) — mlx format only
  • W3A16(bits:3,group_size:128,sym:True,act_bits:16)
  • W2A16(bits:2,group_size:128,sym:True,act_bits:16)
  • GGUF:Q4_K_M(all Q_K,Q_0,Q*_1 provided by llamacpp are supported)
  • Mixed Bits Weight only
  • NVFP4(Experimental feature, recommend exporting to llm_compressor format.data_type nvfp4,act_data_type nvfp4,static_global_scale,group_size 16)
  • MXFP4(Research feature, no real kernel, Standard MXFP4, data_type mxfp,act_data_type mxfp,bits 4, act_bits 4, group_size 32)
  • MXINT4(Research feature, no real kernel, Standard MXINT4, data_type mxint,act_data_type mxint,bits 4, act_bits 4, group_size 32)
  • MXFP4_RCEIL(Research feature,no real kernel, NVIDIA's variant, data_type mxfp,act_data_type mxfp_rceil,bits 4, act_bits 4, group_size 32)
  • MXFP8(Research feature, no real kernel, data_type mxfp,act_data_type mxfp_rceil,group_size 32)
  • FPW8A16(Research feature, no real kernel, data_type fp8,group_size 0->per tensor )
  • FP8_STATIC(Research feature, no real kernel, data_type:fp8,act_data_type:fp8,group_size -1 ->per channel, act_group_size=0->per tensor)

Besides, you could modify the group_size, bits, sym and many other configs you want, though there are maybe no real kernels.

Supported Export Formats

You can use command auto_round list format to show all supported formats with support scheme.

AutoRound Format: This format is well-suited for CPU, Intel GPU, CUDA and HPU devices, 2 bits, as well as mixed-precision inference. [2,3,4,8] bits are supported. Please set --format auto_round

GGUF Format: Experimental feature. This format is well-suited for CPU devices and is widely adopted by the community. q*_k,q*_0,q*_1 are supported. Please set --format gguf:q4_k_m, --format gguf:q2_k_s, etc

AutoGPTQ Format: This format is well-suited for symmetric quantization on CUDA devices and is widely adopted by the community, [2,3,4,8] bits are supported. However, the asymmetric kernel has issues that can cause considerable accuracy drops, particularly at 2-bit quantization and small models. Besides, recently 3 bits may have some accuracy issues in Transformers. Please set --format auto_gptq

AutoAWQ Format: This format is well-suited for asymmetric 4-bit quantization on CUDA devices and is widely adopted within the community, only 4-bits quantization is supported. Please set --format auto_awq

LLM-Compressor Format: NVFP4, MXFP4(kernel in WIP), MXFP8 are supported. Please set --format llm_compressor

MLX Format: This format targets Apple Silicon (M1/M2/M3/...) and is loaded directly by mlx-lm (text-only LLM) or mlx-vlm (vision/audio + language).

  • Supports 2, 3, 4, 5, 6, 8 bits (5/6 bits are MLX-exclusive — GPTQ/AWQ have no standard packing for them).
  • Native mixed-bit / mixed-group_size via layer_config or AutoScheme (--target_bits 3.5 --options "...");
  • Use --format mlx for a native MLX checkpoint; use --format auto_round:mlx if you want HuggingFace transformers + AutoRound to load it (post-init repacks each layer into MLX QuantLinear on Darwin).
  • Limitation: embedding layer quantization has not supported

Format and scheme support matrix

Gray indicates the absence of a kernel or the presence of only an inefficient/reference kernel. BF16 is mainly for AutoScheme

FormatSupported Schemes
auto_roundW4A16, W2A16, W3A16, W8A16, W2A16G64, W2A16G32, MXFP4, MXFP8, MXFP4_RCEIL, MXFP8_RCEIL, NVFP4, FPW8A16, FP8_STATIC, FP8_BLOCK, BF16, MXINT4
auto_awqW4A16, BF16
auto_gptqW4A16, W2A16, W3A16, W8A16,W2A16G64, W2A16G32, BF16
llm_compressorNVFP4, MXFP4, MXFP8, FPW8A16, FP8_STATIC, FP8_BLOCK
mlx / auto_round:mlxW2A16, W3A16, W4A16, W5A16, W6A16, W8A16, BF16, mixed-bit / mixed-group_size (Apple Silicon only)
ggufGGUF:Q4_K_M, GGUF:Q2_K_S, GGUF:Q3_K_S, GGUF:Q3_K_M, GGUF:Q3_K_L, GGUF:Q4_K_S, GGUF:Q5_K_S, GGUF:Q5_K_M, GGUF:Q6_K, GGUF:Q4_0, GGUF:Q4_1, GGUF:Q5_0, GGUF:Q5_1,GGUF:Q8_0
fp8FP8_BLOCK
fakeall schemes (only for research)

Hardware Compatibility

CPU, Intel GPU, HPU and CUDA for both quantization and inference. The MLX format is exclusive to Apple Silicon (macOS / Darwin) at inference time; quantization (export) itself can be done on any platform.

Environment Configuration

Before starting quantization, you may want to configure AutoRound's environment variables for optimal performance. For detailed information about available environment variables (logging levels, ModelScope integration, workspace settings, etc.), please refer to the Environment Variables Guide.

Command Line Usage

  • AutoRound recipe:

    This setting offers a better trade-off between accuracy and tuning cost, and is recommended in all scenarios.

    auto-round --model Qwen/Qwen3-0.6B  --scheme "W4A16"  --format "auto_gptq,auto_awq,auto_round"
    
  • AutoRoundBest recipe:

    This setting provides the best accuracy in most scenarios but is 4–5× slower than the standard AutoRound recipe. It is especially recommended for 2-bit quantization and is a good choice if sufficient resources are available.

    auto-round-best --model Qwen/Qwen3-0.6B  --scheme "W4A16"  --format "auto_gptq,auto_awq,auto_round"
    
  • AutoRoundLight Settings:

    This setting offers the best speed (2-3X faster than AutoRound), but it may cause a significant accuracy drop for small models and 2-bit quantization. It is recommended for 4-bit settings and models larger than 3B

    auto-round-light --model Qwen/Qwen3-0.6B  --scheme "W4A16"  --format "auto_gptq,auto_awq,auto_round"
    

API usage

AutoRound API Usage

This setting offers a better trade-off between accuracy and tuning cost, and is recommended in all scenarios.

from auto_round import AutoRound

model_name_or_path = "Qwen/Qwen3-0.6B"
ar = AutoRound(
    model_name_or_path,
    scheme="W4A16",
    # enable_torch_compile=True,
)

output_dir = "./tmp_autoround"
# format= 'auto_round'(default), 'auto_gptq', 'auto_awq'
ar.quantize_and_save(output_dir, format="auto_gptq,auto_awq,auto_round")

Mixed Bits Usage

AutoRound(>0.8) offers auto-scheme to generate mixed bits recipe autocially, please refer to AutoScheme section for more details.

Auto-GPTQ and Auto-AWQ only support a limited set of mixed-bit configurations. If you're unsure about the details, we recommend using the AutoRound format.

vLLM and SGLang fuse MoE and QKV layers, so it's recommended not to assign different bit widths to these layers.

from auto_round import AutoRound

model_name_or_path = "Qwen/Qwen3-0.6B"

layer_config = {  #  Supports both full layer names and fuzzy (partial) matching
    "model.decoder.layers.6.self_attn.out_proj": {"bits": 8, "group_size": 32},
    "model.decoder.layers.*k_proj": {"bits": 2, "group_size": 32},
}
ar = AutoRound(
    model_name_or_path,
    layer_config=layer_config,
)

output_dir = "./tmp_autoround"
ar.quantize_and_save(output_dir, format="auto_round")

AutoRoundBest recipe

This setting provides the best accuracy in most scenarios but is 4–5× slower than the standard AutoRound recipe. It is especially recommended for 2-bit quantization and is a good choice if sufficient resources are available.

from auto_round import AutoRound

model_name_or_path = "Qwen/Qwen3-0.6B"
ar = AutoRound(model=model_name_or_path, scheme="W4A16", nsamples=512, iters=1000, low_gpu_mem_usage=True)

output_dir = "./tmp_autoround"
ar.quantize_and_save(output_dir, format="auto_round")

AutoRoundLight recipe

This setting offers the best speed (2 - 3X faster than AutoRound), but it may cause a significant accuracy drop for small models and 2-bit quantization. It is recommended for 4-bit settings and models larger than 3B.

from auto_round import AutoRound

model_name_or_path = "Qwen/Qwen3-0.6B"

ar = AutoRound(
    model=model_name_or_path,
    scheme="W4A16",
    iters=50,
    lr=5e-3,
)

output_dir = "./tmp_autoround"
ar.quantize_and_save(output_dir, format="auto_round")

Recipe recommendation

In conclusion, we recommend using auto-round for W4A16 and auto-round-best for W2A16. However, you may adjust the configuration to suit your specific requirements and available resources.

Recipe Configuration Details
Recipebatch_sizeitersseqlennsampleslr
default82002048128None
best810002048512None
light85020481285e-3

W4G128 Average Accuracy of 13 tasks and Time Cost Results(Testing was conducted on the Nvidia A100 80G using the version of PyTorch 2.6.0 with enable_torch_compile):

ModelQwen2.5-0.5B-InstructFalcon3-3BQwen2.5-7B-InstructMeta-Llama-3.1-8B-InstructFalcon3-10BQwen2.5-72B-Instruct
16bits0.41920.52030.64700.62120.61510.7229
Best0.4137(7m)0.5142(23m)0.6426(58m)0.6116(65m)0.6092(81m)0.7242(575m)
Default0.4129(2m)0.5133(6m)0.6441(13m)0.6106(13m)0.6080(18m)0.7252(118m)
Light0.4052(2m)0.5108(3m)0.6453(5m)0.6104(6m)0.6063(6m)0.7243(37m)
W2G64 results W2G64 Average Accuracy of 13 tasks and Time Cost Results(Testing was conducted on the Nvidia A100 80G using the version of PyTorch 2.6.0 with enable_torch_compile). We recommend using higher precision for the head, tail, and non-expert modules to alleviate the significant accuracy drop.
ModelQwen2.5-0.5B-InstructFalcon3-3BQwen2.5-7B-InstructFalcon3-10BQwen2.5-72B-Instruct
16bits0.41920.52030.64700.61510.7229
Best0.2989(6m)0.4267(24m)0.5343(56m)0.5207(79m)0.6715(564m)
Default0.2878(2m)0.4219(6m)0.5209(13m)0.5133(18m)0.6713(122m)
Light0.2760(2m)0.4063(3m)0.4764(5m)0.4810(7m)0.6581(38m)

AutoScheme

AutoScheme provides an automatic algorithm to generate adaptive mixed bits/data-type quantization recipes. For some accuracy result, please refer this doc here

Please note that mixed data types are supported during tuning, but cannot be exported to real models at this time..

CLI Usage

use iters=200for tuning.

auto_round \
  --model_name  $model_name \
  --avg_bits 6 \
  --options "mxfp4,mxfp8" \
  --ignore_scale_zp_bits \
  --iters 0 \
  --format fake 

API Usage

avg_bits= 3.0
scheme = AutoScheme(avg_bits=avg_bits, options=("W2A16G64“, "W4A16","W8A16"))
ar = AutoRound(model=model_name, scheme=scheme, iters=0, nsamples=1)
ar.quantize_and_save()

Hyperparameters in AutoScheme

avg_bits(float) Target average bits for the whole model; only layers to be quantized will be counted in the average bits calculation.

options(Union[str, list[Union[QuantizationScheme, str]]) the options of quantization schemes to choose from. It could be a string like "W4A16", or a list of strings or QuantizationScheme objects.

ignore_scale_zp_bits(bool) Whether to ignore the bits of scale and zero point in average bits calculation. Default is False.

device_map (Optional[str,dict,torch.device]) only supported in API now, as auto-scheme used more VRAM than auto-round tuning, so you could set a different device_map for it.

shared_layers (Optional[Iterable[Iterable[str]]]) only supported in API now

batch_size (Optional[int]) could be set to 1 to reduce VRAM but increase time cost

low_gpu_mem_usage(bool=True) whether to reduce gpu memory usage at the cost of more time cost

In some serving frameworks, certain layers (e.g., QKV or MoE) are fused to accelerate inference. These fused layers may require the same data type and bit configuration. The shared_layers option simplifies this setup by supporting both regex and full-name matching. Note that regex matching is applied in a block-wise manner.

from auto_round import AutoRound, AutoScheme

shared_layers = [
    ["*.self_attn.k_proj", "v_proj", "q_proj", "out_proj"],
    ("model.decoder.layers.6.fc1", "model.decoder.layers.6.fc2"),
    ("fc1", "fc2"),
]
target_bits = 5.0
model_name = "Qwen/Qwen3-0.6B"
scheme = AutoScheme(avg_bits=target_bits, options=("W4A16", "MXFP8"), shared_layers=shared_layers)
ar = AutoRound(model=model_name, scheme=scheme, iters=0, nsamples=1)
model, layer_config = ar.quantize()

Besides, if you want to fix the scheme for some layers, you could set it via layer_config in AutoRound API.

from auto_round import AutoRound, AutoScheme

model_name = "Qwen/Qwen3-8B"
avg_bits = 3.0
scheme = AutoScheme(avg_bits=avg_bits, options=("GGUF:Q2_K_S", "GGUF:Q4_K_S"), ignore_scale_zp_bits=True)
layer_config = {"lm_head": "GGUF:Q6_K"}

ar = AutoRound(model=model_name, scheme=scheme, layer_config=layer_config, iters=0)
ar.quantize_and_save()

AutoScheme Cost

We tested it on Nvidia A100 80G using torch v2.8.

We will try to optimize the RAM usage in the future. The RAM usage is about 1.1-1.5x of the model's BF16 size

ModelsSchemeVRAM CostTime Cost
Qwen3-8BW2A16 / W4A16 / W8A1614G60s * len of options
Qwen3-8BMXFP4 / MXFP818G60s * len of options
Qwen3-8BGGUF*14G80s * len of options
Qwen3-32BW2A16 / W4A16 / W8A1629G180s* len of options
Qwen3-32BMXFP4 / MXFP829G180s* len of options
Qwen3-32BGGUF*18G300s * len of options
Llama-3.3-70BW2A16 / W4A16 / W8A1632G420s * len of options
Cost w/o low_gpu_mem_usage
ModelsSchemeVRAM Cost
(torch compile)
Time Cost
torch compile
VRAM Cost
wo torch compile
Time Cost
wo torch compile
Qwen3-8BW2A16/W4A16/W8A1634G30s * len of options61G40s * len of options
Qwen3-8BMXFP4/MXFP836G60s * len of options54G120s * len of options
Qwen3-8BGGUF*54G30s * len of options50G23S * len of options
Qwen3-32BW2A16/W4A16/W8A16OOM with 240G---OOM with 240G---
Qwen3-32BMXFP4/MXFP8160G200s * len of options200G240s * len of options
Qwen3-32BGGUF*210G80s * len of options200G60s * len of options

Limitations

Embedding layer is not supported in AutoScheme, it will use the best scheme in options.

OPT RTN Mode

AutoRound also supports Optimized RTN (Round-To-Nearest) mode for fast, calibration-free baseline quantization. Setting iters=0 tp enable it and we recommend using group_size=32 for better results. Check accuracy comparison between RTN and OPT RTN mode

For the GGUF format, we have optimized the RTN algorithm inspired by llamacpp. To use the original (pure) RTN algorithm instead, enable the --disable_opt_rtn option.

from auto_round import AutoRound

model_name_or_path = "Qwen/Qwen3-0.6B"
ar = AutoRound(
    model=model_name_or_path,
    scheme="W4A16",
    iters=0,
)

output_dir = "./tmp_autoround"
ar.quantize_and_save(output_dir, format="auto_round")

Model-Free Mode

Model-free mode performs RTN WOQ quantization without loading the full model into memory. It downloads safetensors files directly, quantizes each Linear weight tensor shard-by-shard, and saves the packed result. This is useful when you want fast, no-calibration quantization with minimal resource requirements.

Auto-enabled by default. As of v0.13, when you pass --iters 0 --disable_opt_rtn together with a supported INT WOQ scheme, the CLI automatically takes the model-free path. This is bit-exactly equivalent to the regular --iters 0 --disable_opt_rtn flow but uses far less memory. Use --disable_model_free to opt out and force the original flow.

Key features:

  • No model object required – only config.json and safetensors files are needed
  • Low disk memory required (If no local model files) – downloads and quantizes one shard at a time, deleting the source shard after processing
  • Per-layer configuration – supports --layer_config for per-layer bit-width overrides and --ignore_layers to keep specific layers in full precision
  • Predefined ignore layers – automatically skips model-specific layers (e.g., MoE gates, MTP layers) based on config detection
  • Bit-exact parity with the standard --iters 0 --disable_opt_rtn flow for all supported schemes
Click to expand supported schemes and examples

Supported schemes

Model-free mode currently supports the following integer weight-only preset schemes (packed in the auto_round:auto_gptq format):

PresetBitsGroup sizeSym
W2A162128true
W2A16G32232true
W2A16G64264true
W4A16 (default)4128true
W4A16_MIXED4128true
W8A168128true

All of the above presets also support asymmetric quantization (sym=False) for 2-bit and 8-bit variants (W2A16, W2A16G32, W2A16G64, W8A16), producing auto_round:auto_gptq-packed output with bit-exact parity to the regular flow. For 4-bit asymmetric quantization the regular flow uses auto_round:auto_awq packing as suggested; use the standard AutoRound flow for that case.

You can also pass a custom QuantizationScheme(bits=N, group_size=G, sym=True/False, data_type="int", act_bits=16) with bits ∈ {2, 4, 8} and any group_size / sym configuration.

Schemes that require special packing kernels (W3A16, FPW8A16, BF16, MXFP4, MXFP8, MXINT4, NVFP4, FP8_BLOCK, FP8_STATIC, INT8_W8A8, GGUF:*, ...) are not supported in model-free mode and will raise ValueError. Use the regular AutoRound flow for those.

CLI Usage

# Easiest: --iters 0 --disable_opt_rtn auto-routes to model-free
auto_round meta-llama/Llama-3.2-1B-Instruct \
  --scheme W4A16 \
  --iters 0 --disable_opt_rtn \
  --output_dir ./int4-llama

# Equivalent explicit invocation
auto_round meta-llama/Llama-3.2-1B-Instruct \
  --model_free \
  --scheme W4A16 \
  --output_dir ./int4-llama

# Opt out of auto-routing and use the regular flow instead
auto_round meta-llama/Llama-3.2-1B-Instruct \
  --scheme W4A16 \
  --iters 0 --disable_opt_rtn --disable_model_free \
  --output_dir ./int4-llama

# With per-layer configuration and ignored layers
auto_round meta-llama/Llama-3.2-1B-Instruct \
  --model_free \
  --scheme W4A16 \
  --group_size 32 \
  --asym \
  --layer_config "{k_proj:{bits:8},v_proj:{bits:8}}" \
  --ignore_layers "mlp" \
  --output_dir ./int4-llama

API Usage

from auto_round import AutoRound

AutoRound(
    model="meta-llama/Llama-3.2-1B-Instruct",
    scheme="W4A16",  # Or a QuantizationScheme instance for custom group_size / sym.
    layer_config={
        ".*k_proj": {"bits": 8, "group_size": 32},
        ".*v_proj": {"bits": 8, "group_size": 32},
    },
    ignore_layers="mlp",
    model_free=True,
).quantize_and_save("./int4-llama")

Note: Model-free mode only supports the auto_round output format and uses RTN (no calibration data, no iterative tuning). For higher-quality quantization or schemes outside the supported list, use the standard AutoRound flow.

GGUF format

Experimental feature. This format is well-suited for CPU devices and is widely adopted by the community.

The optimized RTN mode is suggested (--iters 0) for all bits other than 3 bits.

from auto_round import AutoRound

model_name_or_path = "Qwen/Qwen3-0.6B"
ar = AutoRound(
    model=model_name_or_path,
)
output_dir = "./tmp_autoround"
ar.quantize_and_save(output_dir, format="gguf:q4_k_m")  #  gguf:q*_k_s,gguf:q*_k_0,gguf:q*_k_1,

Quantization Costs

Testing was conducted on the Nvidia A100 80G using the nightly version of PyTorch 2.6.0.dev20241029+cu124. Please note that data loading and packing costs have been excluded from the evaluation. We recommend enabling torch.compile for PyTorch versions 2.6 and above.

To optimize GPU memory usage, in addition to activating low_gpu_mem_usage, you can set gradient_accumulate_steps=8 and a batch_size=1, though this may increase tuning time.

The 3B and 14B models were evaluated on Qwen 2.5, the 8X7B model is Mixtral, while the remaining models utilized LLaMA 3.1.

Torch version/Config W4G1283B8B14B70B8X7B
2.6 with torch compile7min
10GB
12min
18GB
23min
22GB
120min
42GB
28min
46GB
2.6 with torch compile
low_gpu_mem_usage=True
12min
6GB
19min
10GB
33min
11GB
140min
25GB
38min
36GB
2.6 with torch compile
low_gpu_mem_usage=True
gradient_accumulate_steps=8,bs=1
15min
3GB
25min
6GB
45min
7GB
187min
19GB
75min
36GB
2.5 w/o torch compile8min
10GB
16min
20GB
30min
25GB
140min
49GB
50min
49GB

W4G128 Quantization Time and Memory Usage (Intel GPU B60 24G) Testing was conducted on the Intel GPU B60 24G using the release version of PyTorch 2.11.0+xpu. Please note that data loading and packing costs have been excluded from the evaluation. Time and memory usage were measured using Qwen2.5-series models.

Torch version/Config W4G1280.5B1.5B3B7B
2.11.0+xpu with torch compile6min
2.9GB
13min
5.4GB
22min
7.1GB
40min
14.9GB
2.11.0+xpu with torch compile
low_gpu_mem_usage=True
10min
1.7GB
17min
3.3GB
30min
4.3GB
50min
8.5GB
2.11.0+xpu with torch compile
low_gpu_mem_usage=True
gradient_accumulate_steps=8,bs=1
14min
0.4GB
22min
1.1GB
38min
1.5GB
1h 4min
4.1GB
2.11.0+xpu w/o torch compile6min
2.9GB
14min
5.7GB
26min
7.6GB
51min
15.5GB

Device/Multi-GPU setting in Quantization

The tuning device is specified using the device_map argument in AutoRound API, not through the device_map parameter used by Transformers.from_pretrained.

AutoRound tunes the model in a block-by-block manner. Although the block size is much smaller than the model size, it still requires a significant amount of GPU memory for tuning—typically 10 times the block size. This can lead to out-of-memory (OOM) errors when working with extremely large models.

For strategies to reduce GPU memory usage, please refer to the [Reduced GPU Memory Usage](###Adjust Hyperparameters) section below, where you can adjust hyperparameters to optimize memory consumption.

If adjusting hyperparameters does not resolve the issue a, a simple solution is just adding more devices in device_map, for example,

from auto_round import AutoRound

model_name_or_path = "Qwen/Qwen3-0.6B"
ar = AutoRound(
    model=model_name_or_path,
    device_map="0,1,2,3"
)

or

CUDA_VISIBLE_DEVICES=0,1,2,3 auto-round --model "Qwen/Qwen3-0.6B" --scheme "W4A16" --device_map "auto"

There are typically two scenarios that require multi-GPU tuning: one is the calibration phase mainly for lm-head quantization, and the other is quantizing extremely large models (e.g., models larger than 100 GB).

Enable multiple gpus calibration in lm_head quantization

For LM head tuning, AutoRound needs to cache the inputs to the lm-head, which requires the entire model to reside on the GPU for efficient calibration. If there is no enough VRAM, some layers will fallback to RTN mode

Manually set the device_map

Customized device map If device_map=auto does not correctly map the model, we also support mapping different layers within a block to different devices by setting the `device_map` argument in the AutoRound API. For reference, we provide an example of quantizing the DeepSeekV3-BF16 (1.4T) model using five 80GB GPUs.
import torch
from transformers import AutoModelForCausalLM, AutoTokenizer

model_name = "opensourcerelease/DeepSeek-R1-bf16"

tokenizer = AutoTokenizer.from_pretrained(model_name)
model = AutoModelForCausalLM.from_pretrained(model_name, trust_remote_code=True, torch_dtype="auto")

block = model.model.layers
device_map = {}

for n, m in block.named_modules():
    if type(m) == torch.nn.Linear:
        if "experts" in n and ("shared_experts" not in n) and int(n.split(".")[-2]) < 63:
            device = "cuda:1"
        elif (
            "experts" in n
            and ("shared_experts" not in n)
            and int(n.split(".")[-2]) >= 63
            and int(n.split(".")[-2]) < 128
        ):
            device = "cuda:2"
        elif (
            "experts" in n
            and ("shared_experts" not in n)
            and int(n.split(".")[-2]) >= 128
            and int(n.split(".")[-2]) < 192
        ):
            device = "cuda:3"
        elif "experts" in n and ("shared_experts" not in n) and int(n.split(".")[-2]) >= 192:
            device = "cuda:4"
        else:
            device = "cuda:0"
        n = n[2:]

        device_map.update({n: device})

from auto_round import AutoRound

autoround = AutoRound(
    model=model,
    tokenizer=tokenizer,
    device_map=device_map,
    nsamples=512,
    batch_size=4,
    low_gpu_mem_usage=True,
    seqlen=2048,
)
autoround.quantize()
autoround.save_quantized(format="auto_awq", output_dir="tmp_autoround")

Adjust Hyperparameters

  • Reduced GPU Memory Usage:

    • set enable_torch_compile to True

    • enable low_gpu_mem_usage(more tuning cost)

    • set --bs 1 --gradient_accumulate_steps 8 (more tuning cost)

    • reduce the bs to 4(potential accuracy drop)

    • reduce the seqlen to 512 (potential accuracy drop)

    • or combine them

  • Reduced CPU Memory Usage :

    • Enable low_cpu_mem_usage (experimental): Only one export format is supported. The quantized model is saved immediately after each block is packed, reducing peak CPU memory usage.

    • Trigger immediate packing: Packing will be triggered immediately when using the command-line interface or the quantize_and_save API, as long as only one export format is specified.

  • Speedup the tuning:

    • set enable_torch_compile to True

    • use auto-round-light configuration

    • reduce the seqlen to 512(potential large accuracy drop for some scenarios)

    • reduce the train bs to 4(little accuracy drop. )

    • or combine them

  • Enable quantized lm-head:

    Currently only support in AutoRound format inference for this config

    auto-round --model_name Qwen/Qwen3-0.6B  --scheme "W4A16" --quant_lm_head --format "auto_round"
    
  • Utilize the AdamW Optimizer:

    Include the flag --adam. Note that AdamW is less effective than sign gradient descent in many scenarios we tested.

Hadamard Transform

AutoRound supports Hadamard transform as an optional weight/activation transformation technique, which can improve quantization accuracy by rotating the weight/activation matrix. This is particularly useful for certain quantization scenarios.

Overview

The Hadamard transform is particularly useful in scenarios where activation outliers hurt quantization accuracy. In practice, it helps suppress such outliers, making it especially effective when act_bits < 8. Users can enable this feature when they need more stable activation distributions and better accuracy in low‑bit quantization settings.

Implementation

AutoRound provides two types of Hadamard transforms:

  1. Deterministic Hadamard Transform (hadamard): Uses Sylvester's construction to create a deterministic Hadamard matrix. The size must be a power of 2.

  2. Random Hadamard Transform (random_hadamard): Uses known Hadamard matrices from N. J. A. Sloane's Library of Hadamard Matrices. Supports non-power-of-2 sizes and deterministic seeding.

Quantization with Hadamard Transform

from auto_round import AutoRound

# Load a model (supports FP8/BF16/FP16/FP32)
model_name_or_path = "meta-llama/Llama-3.1-8B-Instruct"
output_dir = "./Llama-3.1-8B-Instruct-mxfp4-ht"

# rotation_config="default": block_size=32, hadamard_type="hadamard"
ar = AutoRound(model_name_or_path, scheme="MXFP4", rotation_config="default")

ar.quantize_and_save(output_dir=output_dir, format="auto_round")

Transform Classes

ClassDescription
HadamardTransformApplies deterministic Hadamard transform
RandomHadamardTransformApplies random Hadamard transform with optional seeding

Parameters

ParameterDescription
block_sizeSize of the transformation block (default: 32)
seedRandom seed (for RandomHadamardTransform)

4 Inference

AutoRound automatically selects the best available backend based on the installed libraries and prompts the user to install additional libraries when a better backend is found.

Please avoid manually moving the quantized model to a different device (e.g., model.to('cpu')) during inference, as this may cause unexpected exceptions.

CPU

Supports 2, 4, and 8 bits. We recommend using auto-round-lib (ark) for inference. When using the ark backend, ensure that your PyTorch version is >= 2.8.0.

from transformers import AutoModelForCausalLM, AutoTokenizer

model_name = "OPEA/Qwen2.5-1.5B-Instruct-int4-sym-inc"
model = AutoModelForCausalLM.from_pretrained(model_name, device_map="cpu", torch_dtype="auto")
tokenizer = AutoTokenizer.from_pretrained(model_name)
text = "There is a girl who likes adventure,"
inputs = tokenizer(text, return_tensors="pt").to(model.device)
print(tokenizer.decode(model.generate(**inputs, max_new_tokens=50, do_sample=False)[0]))

Intel GPU

Supports 4,8 bits. When using the ark backend, PyTorch (torch) >= 2.8.0 is required. We recommend using auto-round-lib (ark) for inference.

from transformers import AutoModelForCausalLM, AutoTokenizer

model_name = "OPEA/Qwen2.5-1.5B-Instruct-int4-sym-inc"
model = AutoModelForCausalLM.from_pretrained(model_name, device_map="xpu", torch_dtype="auto")
tokenizer = AutoTokenizer.from_pretrained(model_name)
text = "There is a girl who likes adventure,"
inputs = tokenizer(text, return_tensors="pt").to(model.device)
print(tokenizer.decode(model.generate(**inputs, max_new_tokens=50, do_sample=False)[0]))

CUDA

Supports 2, 3, 4, and 8 bits. We recommend using GPTQModel for 4 and 8 bits inference.

from transformers import AutoModelForCausalLM, AutoTokenizer

model_name = "OPEA/Qwen2.5-1.5B-Instruct-int4-sym-inc"
model = AutoModelForCausalLM.from_pretrained(model_name, device_map="cuda", torch_dtype="auto")
tokenizer = AutoTokenizer.from_pretrained(model_name)
text = "There is a girl who likes adventure,"
inputs = tokenizer(text, return_tensors="pt").to(model.device)
print(tokenizer.decode(model.generate(**inputs, max_new_tokens=50, do_sample=False)[0]))

HPU

docker image with Gaudi Software Stack is recommended. More details can be found in Gaudi Guide.

import habana_frameworks.torch.core as htcore
import habana_frameworks.torch.hpu as hthpu
from transformers import AutoModelForCausalLM, AutoTokenizer
import torch

model_name = "Intel/Qwen2-7B-int4-inc"
tokenizer = AutoTokenizer.from_pretrained(model_name)
model = AutoModelForCausalLM.from_pretrained(model_name).to("hpu").to(torch.bfloat16)
text = "There is a girl who likes adventure,"
inputs = tokenizer(text, return_tensors="pt").to(model.device)
print(tokenizer.decode(model.generate(**inputs, max_new_tokens=50, do_sample=False)[0]))

Specify Inference Backend

AutoRound automatically selects the backend for each layer based on compatibility. In general, the priority order is Marlin > ExLLaMAV2 > Triton, but the final choice depends on factors such as group size, bit width, packing format, hardware device, and other implementation details.

The backend may not always be the most suitable for certain devices. You can specify your preferred backend such as "ark" for CPU and Intel GPU, "marlin/exllamav2/triton" for CUDA, according to your needs or hardware compatibility. Please note that additional corresponding libraries may be required.

from transformers import AutoModelForCausalLM, AutoTokenizer, AutoRoundConfig

model_name = "OPEA/Qwen2.5-1.5B-Instruct-int4-sym-inc"
quantization_config = AutoRoundConfig(backend="ark")
model = AutoModelForCausalLM.from_pretrained(
    model_name, device_map="cpu", quantization_config=quantization_config, torch_dtype="auto"
)
tokenizer = AutoTokenizer.from_pretrained(model_name)
text = "There is a girl who likes adventure,"
inputs = tokenizer(text, return_tensors="pt").to(model.device)
print(tokenizer.decode(model.generate(**inputs, max_new_tokens=50, do_sample=False)[0]))
NameDevicesBitsDtypesPriorityPacking formatRequirements
arkcpu2,4,8FP32/FP16/BF166gptq/gptq_zp+-1auto-round-lib
torch>=2.8.0
arkcpu4FP32/FP16/BF166awqauto-round-lib
torch>=2.8.0
arkxpu4,8FP32/FP16/BF166gptq/gptq_zp+-1auto-round-lib
torch>=2.8.0
arkxpu4FP32/FP16/BF166awqauto-round-lib
torch>=2.8.0
marlincuda4,8BF16/FP166gptq/gptq_zp+-1gptqmodel
exllamav2 or
gptqmodel:exllamav2
cuda4BF16/FP165gptq/gptq_zp+-1gptqmodel
exllamav2 or
gptq:exllamav2
cuda4FP163gptq_zp+-1auto-gptq
transformers<5.0.0
gptq:cudacuda2,3,4,8FP161gptq_zp+-1auto-gptq
transformers<5.0.0
tritonxpu/cuda2,4,8BF16/FP162gptq/gptq_zp+-1auto-round
awqcuda4FP165awqauto-awq
transformers<4.57.0
gptqmodel:awq or
gptqmodel:awq_exllamav2
cuda4BF16/FP166awqgptqmodel
gptqmodel:awq_marlincuda4,8FP165awqgptqmodel
gptqmodel:awq_gemmcuda4FP163awqgptqmodel
gptqmodel:awq_torchcuda/cpu4FP162awqgptqmodel
hpuhpu4BF160gptq/gptq_zp+-1auto-round
torchxpu/cpu/cuda2,3,4,8BF16/FP160gptq/gptq_zp+-1auto-round

Convert GPTQ/AWQ to AutoRound

Most GPTQ/AWQ models can be converted to the AutoRound format for better compatibility and support with Intel devices. Please note that the quantization config will be changed if the model is serialized.

from transformers import AutoModelForCausalLM, AutoTokenizer, AutoRoundConfig

model_name = "ybelkada/opt-125m-gptq-4bit"
quantization_config = AutoRoundConfig()
model = AutoModelForCausalLM.from_pretrained(
    model_name, device_map="cpu", quantization_config=quantization_config, torch_dtype="auto"
)
tokenizer = AutoTokenizer.from_pretrained(model_name)
text = "There is a girl who likes adventure,"
inputs = tokenizer(text, return_tensors="pt").to(model.device)
print(tokenizer.decode(model.generate(**inputs, max_new_tokens=50, do_sample=False)[0]))

5 Evaluation

AutoRound leverages lm-eval-harness for evaluation. If --tasks is not specified, a set of default tasks (typically 10+ common benchmarks) will be automatically used.

Single GPU Evaluation

HF Backend (default):

auto-round --model Qwen/Qwen3-0.6B --bits 4 --format "auto_round,auto_gptq" --tasks mmlu

vLLM Backend:

auto-round --model Qwen/Qwen3-0.6B --bits 4 --format "auto_round,auto_gptq" --tasks mmlu --eval_backend vllm

Multi-GPU Evaluation

HF Backend:

auto-round --model="your_model_path" --eval --device_map 0,1 --tasks lambada_openai --eval_bs 16

vLLM Backend (Option 1 - using --device_map):

auto-round "your_model_path" --eval --device_map 0,1 --tasks lambada_openai --eval_backend vllm

vLLM Backend (Option 2 - manual configuration):

CUDA_VISIBLE_DEVICES=0,1 auto-round "your_model_path" --eval --tasks lambada_openai --eval_backend vllm --vllm_args="tensor_parallel_size=2,gpu_memory_utilization=0.8"

Important Notes

  • Use the --eval flag to evaluate models directly. This supports both original and quantized models.
  • The --eval_task_by_task option helps handle task failures by evaluating tasks sequentially. This only applies to the HF backend.
  • When multiple formats are exported, the last format in the list will be used for evaluation.
  • For vLLM backend, you can use --device 0,1,2 to specify GPU devices. This will automatically set CUDA_VISIBLE_DEVICES and configure tensor_parallel_size based on the number of devices. Alternatively, you can manually set these via environment variables and --vllm_args.

6 Known Issues

Randomness in quantization may affect tuning results for some models, set enable_deterministic_algorithms=True to ensure reproducibility.

Some VLMs require manual support.

Mamba is not supported.