Chaining Multiple Optimizations Techniques

December 17, 2025 · View on GitHub

This directory demonstrates how to chain multiple optimization techniques like Pruning, Distillation, and Quantization together to achieve the best performance on a given model.

HuggingFace BERT Pruning + Distillation + Quantization

This example shows how to compress a Hugging Face Bert large model for Question Answering using the combination of modelopt.torch.prune, modelopt.torch.distill and modelopt.torch.quantize. More specifically, we will:

Prune the Bert large model to 50% FLOPs with GradNAS algorithm and fine-tune with distillation
Quantize the fine-tuned model to INT8 precision with Post-Training Quantization (PTQ) and Quantize Aware Training (QAT) with distillation
Export the quantized model to ONNX format for deployment with TensorRT

The main python file is bert_prune_distill_quantize.py and scripts for running it for all 3 steps are available in the scripts directory.

NOTE: This example has been tested on 8 x 24GB A5000 GPUs with PyTorch 2.4 and CUDA 12.4. It takes about 2 hours to complete all the stages of the optimization. Most of the time is spent on fine-tuning and QAT.

Pre-requisites

Install Model Optimizer with optional torch and huggingface dependencies:

pip install "nvidia-modelopt[hf]"
pip install -r requirements.txt

Running the example

To run the example, execute the following scripts in order:

First we prune the Bert large model to 50% FLOPs with GradNAS algorithm. Then, we fine-tune the pruned model with distillation from unpruned teacher model to recover 99+% of the initial F1 score (93.15). We recommend using multiple GPUs for fine-tuning. Note that we use more epochs for fine-tuning, which is different from the 2 epochs used originally in fine-tuning Bert without distillation since distillation requires more epochs to converge but achieves much better results.
```
bash scripts/1_prune.sh
```
Quantize the fine-tuned model to INT8 precision and run calibration (PTQ). Note that PTQ will result in a slight drop in F1 score but we will be able to recover the F1 score with QAT. We run QAT with distillation as well from unpruned teacher model.
```
bash scripts/2_int8_quantize.sh
```
Export the quantized model to ONNX format for deployment with TensorRT.
```
bash scripts/3_onnx_export.sh
```