[WACV 2025 (Oral)] PTQ4VM: Post-training Quantization for Visual Mamba

April 7, 2025 · View on GitHub

This is official code for the paper PTQ4VM.

PTQ4VM can be applied to various Visual Mamba backbones, converting the pretrained model to a quantized format in under 15 minutes without notable quality degradation.

Updates

Apr. 6th, 2025: Update: We fixed the code of VMamba. There was a slight performance drop at 4-bit. We have updated it on arxiv, please check it out.
Mar. 3rd, 2025: Update: we release the code of VMamba.

Install

Setting conda

conda create -n ptq4vm python=3.10 -y
conda activate ptq4vm

Clone the PTQ4VM repository

git clone https://github.com/YoungHyun197/ptq4vm
cd ptq4vm

Install the dependencies

pip install -r requirements.txt
pip install causal-conv1d==1.1.1
pip install mamba-ssm==1.2.0.post1

Replace core implementation of Mamba

cp -rf mamba-1p1p1/mamba_ssm /opt/conda/lib/python3.10/site-packages

Install the CUDA kernel

python ./cuda_measure/setup_vim_GEMM.py install

How to use PTQ4VM

Here we use Vision Mamba (Vim) model as an example. Before applying ptq4vm, prepare a pre-trained model. You can download the model from this url.

You can check the VMamba example from this url.

Generate activation smoothing scale

torchrun --nproc_per_node 1 generate_act_scale.py --resume [model-path] --model vim_tiny_patch16_224_bimambav2_final_pool_mean_abs_pos_embed_with_midclstok_div2 --data-path [imagenet path] --batch-size 256

Joint Learning of Smoothing Scale and Step size (JLSS)

torchrun --nproc_per_node 1 quant.py --eval --resume [model-path] --model vim_tiny_patch16_224_bimambav2_final_pool_mean_abs_pos_embed_with_midclstok_div2 --data-path [imagenet-path] --act_scales [smoothing-path] --batch-size 256 --qmode ptq4vm --train-batch 256 --n-lva 16 --n-lvw 16 --alpha 0.5 --epochs 100 --lr-a 5e-4 --lr-w 5e-4 --lr-s 1e-2

n-lva (n-lvw) : activation (weight) quantizaiton levels (8/6/4-bit: 256/64/16)
- Refer to the initialize() function of Q_Linear and Q_Act classes in ptq4vm/quantizer.py
lr-a (lr-w, lr-s) : learning rates of activation (weight, smooth scale) step size

For experimental details and hyper-paramters, please refer to the paper and quant.py file

Speedup using CUDA kernel

Check the layer-wise acceleration

python cuda_sandbox.py

Check the end-to-end acceleration

torchrun --nproc_per_node 1 quant.py --eval --time_compare --resume [model-path] --model vim_tiny_patch16_224_bimambav2_final_pool_mean_abs_pos_embed_with_midclstok_div2 --data-path [imagenet-path] --act_scales [smoothing-path] --batch-size 256 --qmode ptq4vm --train-batch 256 --n-lva 16 --n-lvw 16 --alpha 0.5 --epochs 100 --lr-a 5e-4 --lr-w 5e-4 --lr-s 1e-2

Reference

Vision Mamba: Efficient Visual Representation Learning with Bidirectional State Space Model

This example code is based on Vim.

Cite

If you find our code or PTQ4VM paper useful for your research, please consider citing:

@article{cho2024ptq4vm,
  title={PTQ4VM: Post-Training Quantization for Visual Mamba},
  author={Cho, Younghyun and Lee, Changhun and Kim, Seonggon and Park, Eunhyeok},
  journal={arXiv preprint arXiv:2412.20386},
  year={2024}
}