README.md

September 17, 2025 · View on GitHub

MedPTQ overview

MedPTQ: A Practical Toolkit for Real Post-Training Quantization
in 3D Medical Image Segmentation

MedPTQ Models MedPTQ Paper GitHub Repo stars

We introduce MedPTQ, an open-source toolkit for real post-training quantization (PTQ) that implements true 8-bit (INT8) inference on state-of-the-art (SOTA) 3D medical segmentation models

News

  • [2025-09-17] 🔥 New — INT8 quantized U-Net and TransUNet have been released (see MedPTQ Models).

Method

MedPTQ overview

Overview of MedPTQ. The top row illustrates the original FP32 pipeline, where both activation XX and weight WW are in full precision and pass through Conv–BN–ReLU sequentially. The middle row shows the simulated quantization stage: QuantizeLinear and DequantizeLinear nodes are inserted after both activations and weights to simulate INT8 quantization semantics, while the model still executes in FP32. The bottom row demonstrates the real INT8 TensorRT engine, where TensorRT fuses FP32 weights with their associated QuantizeLinear into INT8 weights, and merges activation DequantizeLinear, weight DequantizeLinear convolution, BN, and ReLU into a single fused convolution block. This fusion enables optimized INT8 convolution kernels, reducing memory traffic and improving efficiency while preserving accuracy.

MedPTQ Models

Model Download Dataset
U-Net FP32 ONNX
INT8 ONNX
BTCV
TransUNet FP32 ONNX
INT8 ONNX

Note

Release progress: 🟩🟩⬜⬜⬜⬜⬜ 2/7
Released: U-Net, TransUNet • Coming soon: UNesT, VISTA3D, SegResNet, SwinUNETR, nnU-Net

Getting Started

Performance

BTCV (N = 20, C = 13)

Model Model Size (MB) Latency (ms) mDSC
FP32INT8 FP32INT8 FP32INT8
U-Net 23.116.61 2.621.05 0.8220.822
TransUNet 351.8591.90 4.091.74 0.8160.816

Whole Brain Segmentation (N = 50, C = 133)

Model Model Size (MB) Latency (ms) mDSC
FP32INT8 FP32INT8 FP32INT8
UNesT 349.4196.72 5.592.72 0.7020.701

TotalSegmentator V2 (N = 200, C = 104)

Model Model Size (MB) Latency (ms) mDSC
FP32INT8 FP32INT8 FP32INT8
nnU-Net 107.8433.97 2.991.25 0.9010.895
SwinUNETR 247.9670.18 9.853.59 0.8780.877
SegResNet 170.4450.29 5.142.06 0.8820.879
VISTA3D 264.5771.18 4.591.93 0.8930.891

Quantization results of SOTA medical segmentation models. We evaluate MedPTQ on seven models (U-Net, TransUNet, UNesT, nnU-Net, SwinUNETR, SegResNet, VISTA3D) across three datasets with different numbers of samples (N) and classes (C): BTCV (N = 20, C = 13), Whole Brain Segmentation (N = 50, C = 133), and TotalSegmentator V2 (N = 200, C = 104). All models are compiled to TensorRT for both FP32 and INT8; we report Model Size (MB), Latency (ms), and mDSC. Compared with FP32, INT8 consistently compresses model size by 3.17×–3.83× and reduces latency by 2.06×–2.74×, while maintaining accuracy (absolute ΔmDSC ≤ 0.006).

Citation

If you find MedPTQ useful, please cite:

@article{qu2025post,
  title={Post-Training Quantization for 3D Medical Image Segmentation: A Practical Study on Real Inference Engines},
  author={Qu, Chongyu and Zhao, Ritchie and Yu, Ye and Liu, Bin and Yao, Tianyuan and Zhu, Junchao and Landman, Bennett A and Tang, Yucheng and Huo, Yuankai},
  journal={arXiv preprint arXiv:2501.17343},
  year={2025}
}

Acknowledgments

This research was supported by NIH R01DK135597 (Huo), DoD HT9425-23-1-0003 (HCY), NSF 2434229 (Huo), and KPMP Glue Grant. This work was also supported by Vanderbilt Seed Success Grant, Vanderbilt Discovery Grant, and VISE Seed Grant. This project was supported by The Leona M. and Harry B. Helmsley Charitable Trust grant G-1903-03793 and G-2103-05128. This research was also supported by NIH grants R01EB033385, R01DK132338, REB017230, R01MH125931, and NSF 2040462. We extend gratitude to NVIDIA for their support by means of the NVIDIA hardware grant. This work was also supported by NSF NAIRR Pilot Award NAIRR240055.