Post-training quantization (PTQ) for Vision Language Models

March 24, 2026 · View on GitHub

To learn more about the quantization feature, please refer to the documentation.

Quantization is an effective model optimization technique that compresses your models. Quantization with Model Optimizer can compress model size by 2x-4x, speeding up inference while preserving model quality.
Model Optimizer enables highly performant quantization formats including NVFP4, FP8, INT8, INT4 and supports advanced algorithms such as SmoothQuant, AWQ, SVDQuant, and Double Quantization with easy-to-use Python APIs.

This section focuses on Post-training quantization for VLM (Vision Language Models), a technique that reduces model precision after training to improve inference efficiency without requiring retraining.

Section	Description	Link	Docs
Pre-Requisites	Required & optional packages to use this technique	[Link]
Getting Started	Learn how to optimize your models using PTQ to reduce precision and improve inference efficiency	[Link]	[docs]
Support Matrix	View the support matrix to see quantization compatibility and feature availability across different models	[Link]
Framework Scripts	Example scripts demonstrating quantization techniques for optimizing Hugging Face / Megatron-Bridge / Megatron-LM models	[Link]
Pre-Quantized Checkpoints	Ready to deploy Hugging Face pre-quantized checkpoints	[Link]
Resources	Extra links to relevant resources	[Link]

Pre-Requisites

Please refer to the llm_ptq/README.md for the pre-requisites.

Getting Started

Please refer to the llm_ptq/README.md for the getting-started.

Support Matrix

Supported Models

Model	fp8	int8_sq¹	int4_awq	w4a8_awq²	nvfp4³
Llava	✅	✅	✅	✅	-
VILA	✅	✅	✅	✅	-
Phi-3-vision, Phi-4-multimodal	✅	✅	✅	✅	✅
Qwen2, 2.5-VL	✅	✅	✅	✅	✅
Gemma3	✅	-	-	-	-

^1.Only TensorRT-LLM checkpoint export is supported. Not compatible with the TensorRT-LLM torch backend
^2.The w4a8_awq is an experimental quantization scheme that may result in a higher accuracy penalty.
^3.A selective set of the popular models are internally tested. The actual model support list may be longer. NVFP4 inference requires Blackwell GPUs and TensorRT-LLM v0.17 or later.

For detailed TensorRT-LLM torch backend multimodal support, please refer to this doc

The accuracy loss after PTQ may vary depending on the actual model and the quantization method. Different models may have different accuracy loss and usually the accuracy loss is more significant when the base model is small. If the accuracy after PTQ is not meeting the requirement, please try either modifying hf_ptq.py and disabling the KV cache quantization or using the QAT instead.

Framework Scripts

Please refer to the llm_ptq/README.md about the details of model quantization.

The following scripts provide an all-in-one and step-by-step model quantization example for the supported Hugging Face multi-modal models. The quantization format and the number of GPUs will be supplied as inputs to these scripts.

Hugging Face Example Script

scripts/huggingface_example.sh --model <Hugging Face model card or checkpoint> --quant [fp8|nvfp4|int8_sq|int4_awq|w4a8_awq]

Megatron-Bridge Example

Please refer to the examples/megatron_bridge/ for example scripts for PTQ with Megatron-Bridge.

Pre-Quantized Checkpoints

Ready-to-deploy checkpoints [🤗 Hugging Face - Nvidia Model Optimizer Collection]
Deployable on TensorRT-LLM, vLLM and SGLang
More models coming soon!