Post-training quantization (PTQ) for Vision Language Models

March 24, 2026 · View on GitHub

To learn more about the quantization feature, please refer to the documentation.

Quantization is an effective model optimization technique that compresses your models. Quantization with Model Optimizer can compress model size by 2x-4x, speeding up inference while preserving model quality.
Model Optimizer enables highly performant quantization formats including NVFP4, FP8, INT8, INT4 and supports advanced algorithms such as SmoothQuant, AWQ, SVDQuant, and Double Quantization with easy-to-use Python APIs.

This section focuses on Post-training quantization for VLM (Vision Language Models), a technique that reduces model precision after training to improve inference efficiency without requiring retraining.

SectionDescriptionLinkDocs
Pre-RequisitesRequired & optional packages to use this technique[Link]
Getting StartedLearn how to optimize your models using PTQ to reduce precision and improve inference efficiency[Link][docs]
Support MatrixView the support matrix to see quantization compatibility and feature availability across different models[Link]
Framework ScriptsExample scripts demonstrating quantization techniques for optimizing Hugging Face / Megatron-Bridge / Megatron-LM models[Link]
Pre-Quantized CheckpointsReady to deploy Hugging Face pre-quantized checkpoints[Link]
ResourcesExtra links to relevant resources[Link]

Pre-Requisites

Please refer to the llm_ptq/README.md for the pre-requisites.

Getting Started

Please refer to the llm_ptq/README.md for the getting-started.

Support Matrix

Supported Models

Modelfp8int8_sq1int4_awqw4a8_awq2nvfp43
Llava-
VILA-
Phi-3-vision, Phi-4-multimodal
Qwen2, 2.5-VL
Gemma3----

1.Only TensorRT-LLM checkpoint export is supported. Not compatible with the TensorRT-LLM torch backend
2.The w4a8_awq is an experimental quantization scheme that may result in a higher accuracy penalty.
3.A selective set of the popular models are internally tested. The actual model support list may be longer. NVFP4 inference requires Blackwell GPUs and TensorRT-LLM v0.17 or later.

For detailed TensorRT-LLM torch backend multimodal support, please refer to this doc

The accuracy loss after PTQ may vary depending on the actual model and the quantization method. Different models may have different accuracy loss and usually the accuracy loss is more significant when the base model is small. If the accuracy after PTQ is not meeting the requirement, please try either modifying hf_ptq.py and disabling the KV cache quantization or using the QAT instead.

Framework Scripts

Please refer to the llm_ptq/README.md about the details of model quantization.

The following scripts provide an all-in-one and step-by-step model quantization example for the supported Hugging Face multi-modal models. The quantization format and the number of GPUs will be supplied as inputs to these scripts.

Hugging Face Example Script

scripts/huggingface_example.sh --model <Hugging Face model card or checkpoint> --quant [fp8|nvfp4|int8_sq|int4_awq|w4a8_awq]

Megatron-Bridge Example

Please refer to the examples/megatron_bridge/ for example scripts for PTQ with Megatron-Bridge.

Pre-Quantized Checkpoints

Resources