Quantize Deepseek models to FP4

June 27, 2026 · View on GitHub

This example will demonstrate the steps to quantize DeepSeek models to FP4 and export a unified checkpoint that can be deployed with TRT-LLM.

Setup

Due to the model size, currently it requires 8xH200 or 16xH100 to quantize the FP8 model, we will use 8xH200 as example.

Directory Layout

  • deepseek_v3/: DeepSeek V3, R1, V3.1, and V3.2 FP4 quantization.
  • deepseek_v4/: DeepSeek V4 routed-expert NVFP4 quantization.

DeepSeek V3 FP4

Convert the HF checkpoint for DeepSeek FP8 inference

# set up variables to run the example
export HF_FP8_CKPT={path_to_downloaded_hf_checkpoint}
export DS_CKPT={path_to_save_converted_checkpoint}
export FP4_QUANT_PATH={path_to_save_quantization_results}
export HF_FP4_PATH={path_to_save_the_final_FP4_checkpoint}

DeepSeek V3 R1 V3.1

# download the FP8 checkpoint from Hugginface. This is an example of DeepSeek-R1
huggingface-cli download deepseek-ai/DeepSeek-R1 --local-dir $HF_FP8_CKPT

# clone DeepSeek-V3 (base model of R1) Github repository for FP8 inference,
git clone https://github.com/deepseek-ai/DeepSeek-V3.git && cd DeepSeek-V3 && git checkout 9b4e978

[Experimental] DeepSeek V3.2

# download the FP8 checkpoint from Hugginface.
huggingface-cli download deepseek-ai/DeepSeek-V3.2-Exp --local-dir $HF_FP8_CKPT

# clone DeepSeek-V3.2 Github repository for FP8 inference,
git clone https://github.com/deepseek-ai/DeepSeek-V3.2-Exp.git && cd DeepSeek-V3.2-Exp && git checkout 87e509a

# Install requirements
pip install git+https://github.com/Dao-AILab/fast-hadamard-transform.git
pip install -r inference/requirements.txt

Convert the Checkpoint

# convert the HF checkpoint to a specific format for Deepseek
python inference/convert.py --hf-ckpt-path $HF_FP8_CKPT --save-path $DS_CKPT --n-experts 256 --model-parallel 8

Post-training quantization

Run the calibration scripts

DeepSeek V3, R1, V3.1

torchrun --nproc-per-node 8 --master_port=12346 deepseek_v3/ptq.py --model_path $DS_CKPT --config DeepSeek-V3/inference/configs/config_671B.json --quant_cfg NVFP4_DEFAULT_CFG --output_path $FP4_QUANT_PATH

DeepSeek V3.2

torchrun --nproc-per-node 8 --master_port=12346 deepseek_v3/ptq.py --model_path $DS_CKPT --config DeepSeek-V3.2-Exp/inference/config_671B_v3.2.json --quant_cfg NVFP4_DEFAULT_CFG --output_path $FP4_QUANT_PATH

MoE expert calibration

By default, calibration uses the model's native top-k routing and then runs a post-calibration sync that sets every expert's input_quantizer.amax (w1/w2/w3) to the per-layer global peer max (all-reduced across EP ranks). weight_quantizer.amax stays per-expert; any uncalibrated expert falls back to a compute path over the dequantized FP8 weight. This mirrors the layer_sync_moe_local_experts_amax flow that mtq runs automatically for QuantSequentialMLP-derived MoEs.

To restore the original behavior — force every token through every expert during calibration (slower, ~2x forwards, no post-calibration sync) — pass --calib_all_experts:

torchrun --nproc-per-node 8 --master_port=12346 deepseek_v3/ptq.py --model_path $DS_CKPT --config DeepSeek-V3.2-Exp/inference/config_671B_v3.2.json --quant_cfg NVFP4_DEFAULT_CFG --output_path $FP4_QUANT_PATH --calib_all_experts

A summary of every TensorQuantizer is written to $FP4_QUANT_PATH/.quant_summary.txt.

Quantize the FP8 hf checkpoint to FP4

We provide a one-step-script which will:

  • Quantize the weights to NVFP4
  • Copy miscellaneous files to the quantized checkpoint
./deepseek_v3/quantize_fp8_to_nvfp4.sh --amax_path $FP4_QUANT_PATH --fp4_output_path $HF_FP4_PATH --fp8_hf_path $HF_FP8_CKPT --world_size 8

DeepSeek V4 routed-expert NVFP4

DeepSeek V4 uses a mixed native checkpoint layout. The V4 recipe quantizes only the routed experts to NVFP4 W4A4 and leaves attention projections, the router gate, shared experts, embeddings, and the LM head in their original formats.

Prepare the MP checkpoint

Keep experts in MXFP4 when resharding with DeepSeek's own convert.py:

export DS_V4=/path/to/DeepSeek-V4-Pro
export MP=8
export MP_CKPT=/path/to/DeepSeek-V4-Pro-mp${MP}-mxfp4
export AMAX=/path/to/amax-nvfp4-experts
export HF_NVFP4_PATH=/path/to/DeepSeek-V4-Pro-nvfp4-experts

python ${DS_V4}/inference/convert.py \
    --hf-ckpt-path ${DS_V4} \
    --save-path ${MP_CKPT} \
    --n-experts 384 \
    --model-parallel ${MP}

Calibrate routed experts

Single node:

torchrun --nproc-per-node ${MP} --master_port 12346 deepseek_v4/ptq.py \
    --model_path ${MP_CKPT} \
    --config ${DS_V4}/inference/config.json \
    --dsv4_inference_dir ${DS_V4}/inference \
    --output_path ${AMAX}

Two 4-GPU nodes for MP=8:

# node 0
torchrun --nnodes=2 --node_rank=0 --master_addr=<ip> --master_port=12346 \
    --nproc-per-node 4 deepseek_v4/ptq.py \
    --model_path ${MP_CKPT} \
    --config ${DS_V4}/inference/config.json \
    --dsv4_inference_dir ${DS_V4}/inference \
    --output_path ${AMAX}

# node 1
torchrun --nnodes=2 --node_rank=1 --master_addr=<ip> --master_port=12346 \
    --nproc-per-node 4 deepseek_v4/ptq.py \
    --model_path ${MP_CKPT} \
    --config ${DS_V4}/inference/config.json \
    --dsv4_inference_dir ${DS_V4}/inference \
    --output_path ${AMAX}

Export back to HF shard layout

deepseek_v4/quantize_to_nvfp4.py operates on the original HF-style V4 checkpoint and produces a new HF-style checkpoint with routed expert weights replaced by NVFP4 tensors plus weight_scale, weight_scale_2, and input_scale.

python deepseek_v4/quantize_to_nvfp4.py \
    --amax_path ${AMAX} \
    --source_ckpt ${DS_V4} \
    --output_ckpt ${HF_NVFP4_PATH}

The output includes an updated model.safetensors.index.json, a config.json with quantization_config.moe_quant_algo = "NVFP4", and hf_quant_config.json describing the mixed NVFP4 expert layers.

When the source routed experts are MXFP4 (as in the V4 release), add --cast_mxfp4_to_nvfp4 for a lossless weight conversion — recommended over the default lossy dequant/re-quant path. See below.

Lossless MXFP4 → NVFP4 weight cast (--cast_mxfp4_to_nvfp4)

The routed experts in the source checkpoint are already MXFP4 (E2M1 nibbles + a power-of-two E8M0 scale per 32-element block). Without the flag, the export dequantizes them to BF16 and re-quantizes to NVFP4 using the calibrated per-tensor weight amax, which re-derives the per-block scales from the data and is therefore lossy. With --cast_mxfp4_to_nvfp4, the per-tensor scale_2 is pinned to 2^(k_max - 8) and each per-block E4M3 scale to 2^(k_j - m) straight from the source E8M0 scales, so per_block_scale * scale_2 = 2^k_j and the NVFP4 nibbles equal the source MXFP4 nibbles bit-for-bit (for every block whose k_j lands in E4M3's representable window; the rare out-of-range block falls back to a data-derived scale). The flag only affects routed-expert weights — activation input_scale still comes from ${AMAX} calibration — and the run prints a [cast] lossless MXFP4->NVFP4 blocks: … summary. This mirrors the GPTOSS cast in examples/hf_ptq/cast_mxfp4_to_nvfp4.py; the V4 twist is that w1/w3 share one scale_2 (fused GEMM1), so k_max is taken over both projections.