AutoRound for MLLMs

May 20, 2026 · View on GitHub

This feature is experimental and may be subject to changes, including potential bug fixes, API modifications, or adjustments to default parameters

Quantization

API Usage (Gaudi2/CPU/GPU) Recommended

By default, AutoRound only quantizes the text module of VLMs and uses NeelNanda/pile-10k for calibration. To quantize the entire model, you can enable quant_nontext_module by setting it to True, though support for this feature is limited.

from auto_round import AutoRound  # same as llm, AutoRound can determine mllm automatically

model_name = "Qwen/Qwen2-VL-2B-Instruct"

## quantize the model
autoround = AutoRound(model_name, scheme="W4A16", dataset="NeelNanda/pile-10k", quant_nontext_module=False)
output_dir = "./tmp_autoround"
autoround.quantize_and_save(output_dir, format="auto_round")

dataset: the dataset for quantization training. Currently only support NeelNanda/pile-10k, llava_conv_58k, llava_instruct_80k and llava_instruct_150k. Please note that the feasibility of the Llava calibration dataset has only been validated on five models so far.
quant_nontext_module: whether to quantize non-text module, e.g. vision component.

for more hyperparameters introduction, please refer Homepage Detailed Hyperparameters

Basic Usage

A user guide detailing the full list of supported arguments is provided by calling auto-round -h on the terminal. Set the format you want in format and multiple formats exporting has been supported. **Only five model families are supported now.

auto-round \
    --model Qwen/Qwen2-VL-2B-Instruct \
    --scheme w4a16 \
    --format "auto_round" \
    --output_dir ./tmp_autoround

VLM & Omni Support Matrix

For most VLMs, we typically support the default quantization configuration, which involves quantizing only the language component while excluding the visual component. Besides, we also support quantizing non-text modules of models that follow the Hugging Face standard, i.e., those with a typical processor, though inference may have some issues due to model architecture or kernel limitations. Audio models (MiMo-Audio, Qwen3-TTS, etc.) use text-only calibration by default.

Model	calibration dataset	quant nontext module	Quantized Model Link
allenai/Molmo	pile	X	Molmo-7B-D-0924-int4-sym, Molmo-72B-0924-int4-sym-gptq, Molmo-72B-0924-int4-sym
deepseek-ai/deepseek-vl2	pile/llava	√	deepseek-vl2-int4-sym-gptq
fancyfeast/llama-joycaption-beta-one-hf-llava	pile	X	NeoChen1024-int4-gptq
google/gemma-3	pile/llava	√	gemma-3-12b-it-AutoRound-gguf-q4-0, gemma-3-27b-it-AutoRound-gguf-q4-0, gemma-3-12b-it-int4-AutoRound, gemma-3-27b-it-int4-AutoRound
HuggingFaceTB/SmolVLM	pile/llava	√	SmolVLM-Instruct-int4-sym
ibm-granite/granite-vision-3.2	pile/llava	-
liuhaotian/Llava-v1.5	pile/llava	X	llava-v1.5-7b-int4-sym
meta-llama/Llama-3.2-Vision	llava	√	Llama-3.2V-11B-cot-int4-sym, Llama-3.2-11B-Vision-Instruct-qvision-int4-sym, Llama-3.2-90B-Vision-Instruct-int4-sym, Llama-3.2-11B-Vision-Instruct-int4-sym
microsoft/Phi3.5-Vision	pile/llava	√	Phi-3.5-vision-instruct-int4-sym, Phi-3.5-vision-instruct-qvision-int4-sym
mistralai/Mistral-Small-3.1	pile/llava	X	Mistral-Small-3.1-24B-Instruct-2503-int4-AutoRound-gptq-sym, Mistral-Small-3.1-24B-Instruct-2503-int4-AutoRound-awq-sym
moonshotai/Kimi-VL	pile/llava	√
Qwen/Qwen2-VL	pile/llava	-	Qwen2-VL-7B-Instruct-int4-sym, Qwen2-VL-72B-Instruct-int4-sym, Qwen2-VL-72B-Instruct-int2-sym
Qwen/Qwen2.5-Omni	pile	√	Qwen2.5-Omni-7B
Qwen/Qwen3-Omni	pile	√	Qwen3-Omni-30B-A3B-Instruct
Qwen/Qwen3-TTS	pile/audiocaps	X
XiaomiMiMo/MiMo-Audio	pile/audiocaps	X
rhymes-ai/Aria	pile/llava	√
THUDM/CogVLM2	pile/llava	√	cogvlm2-llama3-chat-19B-int4-sym, cogvlm2-llama3-chat-19B-qvision-int4-sym
THUDM/glm-4v	pile	X	glm-4v-9b-int4-sym

√ means support, - means support to export but cannot infer, X means not support.

Calibration Dataset

For mllm, we used text-only calibration dataset (NeelNanda/pile-10k) as our default. If the model type does not support plain text calibration(e.g. Llama-3.2-vision), it will also automatically switch to llava dataset and adjust the hyperparameters.

Through argument --dataset(text file), user can use other datasets such as "liuhaotian/llava_conv_58k" " liuhaotian/llava_instruct_80k", "liuhaotian/llava_instruct_150k" or a file path to use local file.

Nontext Module Quantization

New Models Support

Template

For autoround MLLMs, using Template to customize different operations for different models. User can use template to support new model which not in support list.

from auto_round.compressors.mllm.processor import PROCESSORS
from auto_round.compressors.mllm.template import _register_template

model_type = model.config.model_type
_register_template(model_type=model_type, default_dataset="NeelNanda/pile-10k", processor=PROCESSORS["hf"])

Processor

Processor is callback interface for calling different processors, such as texts or images processors, for MLLMs. User can define own processor and use registration function to declare. For more information, please refer to the relevant code in auto_round/compressors/mllm/processor.py.