[WACV 2025 (Oral)] PTQ4VM: Post-training Quantization for Visual Mamba
April 7, 2025 ยท View on GitHub
This is official code for the paper PTQ4VM.
PTQ4VM can be applied to various Visual Mamba backbones, converting the pretrained model to a quantized format in under 15 minutes without notable quality degradation.
Updates
Apr. 6th, 2025: Update: We fixed the code of VMamba. There was a slight performance drop at 4-bit. We have updated it on arxiv, please check it out.Mar. 3rd, 2025: Update: we release the code of VMamba.
Install
- Setting conda
conda create -n ptq4vm python=3.10 -y
conda activate ptq4vm
- Clone the PTQ4VM repository
git clone https://github.com/YoungHyun197/ptq4vm
cd ptq4vm
- Install the dependencies
pip install -r requirements.txt
pip install causal-conv1d==1.1.1
pip install mamba-ssm==1.2.0.post1
- Replace core implementation of Mamba
cp -rf mamba-1p1p1/mamba_ssm /opt/conda/lib/python3.10/site-packages
- Install the CUDA kernel
python ./cuda_measure/setup_vim_GEMM.py install
How to use PTQ4VM
Here we use Vision Mamba (Vim) model as an example. Before applying ptq4vm, prepare a pre-trained model. You can download the model from this url.
You can check the VMamba example from this url.
Generate activation smoothing scale
torchrun --nproc_per_node 1 generate_act_scale.py --resume [model-path] --model vim_tiny_patch16_224_bimambav2_final_pool_mean_abs_pos_embed_with_midclstok_div2 --data-path [imagenet path] --batch-size 256
Joint Learning of Smoothing Scale and Step size (JLSS)
torchrun --nproc_per_node 1 quant.py --eval --resume [model-path] --model vim_tiny_patch16_224_bimambav2_final_pool_mean_abs_pos_embed_with_midclstok_div2 --data-path [imagenet-path] --act_scales [smoothing-path] --batch-size 256 --qmode ptq4vm --train-batch 256 --n-lva 16 --n-lvw 16 --alpha 0.5 --epochs 100 --lr-a 5e-4 --lr-w 5e-4 --lr-s 1e-2
- n-lva (n-lvw) : activation (weight) quantizaiton levels (8/6/4-bit: 256/64/16)
- Refer to the
initialize()function of Q_Linear and Q_Act classes in ptq4vm/quantizer.py
- Refer to the
- lr-a (lr-w, lr-s) : learning rates of activation (weight, smooth scale) step size
For experimental details and hyper-paramters, please refer to the paper and quant.py file
Speedup using CUDA kernel
- Check the layer-wise acceleration
python cuda_sandbox.py
- Check the end-to-end acceleration
torchrun --nproc_per_node 1 quant.py --eval --time_compare --resume [model-path] --model vim_tiny_patch16_224_bimambav2_final_pool_mean_abs_pos_embed_with_midclstok_div2 --data-path [imagenet-path] --act_scales [smoothing-path] --batch-size 256 --qmode ptq4vm --train-batch 256 --n-lva 16 --n-lvw 16 --alpha 0.5 --epochs 100 --lr-a 5e-4 --lr-w 5e-4 --lr-s 1e-2
Reference
Vision Mamba: Efficient Visual Representation Learning with Bidirectional State Space Model
This example code is based on Vim.
Cite
If you find our code or PTQ4VM paper useful for your research, please consider citing:
@article{cho2024ptq4vm,
title={PTQ4VM: Post-Training Quantization for Visual Mamba},
author={Cho, Younghyun and Lee, Changhun and Kim, Seonggon and Park, Eunhyeok},
journal={arXiv preprint arXiv:2412.20386},
year={2024}
}