Fast3D

July 15, 2025 ยท View on GitHub

Fast3D is a plug-and-play visual token pruning framework for accelerating 3D MLLMs (e.g., Chat-Scene). It could reach 90% visaul token pruning ratio with negligible performance drop through two technical innovations: (1) Global Attention Prediction (GAP), where a lightweight neural network is trained to predict the aggregated attention map from all layers of the target model, enabling efficient token importance estimation for precise pruning guidance; and (2) Sample-Adaptive visual token Pruning (SAP), which dynamically adjusts token budgets based on input complexity to achieve improved overall accuracy-efficiency trade-offs.

Performance Comparison

MethodScanReferMulti3dReferScan2CapScanQASQA3DScore Ratio
Acc@0.5F1@0.5B-4@0.5B-4EM-R
Chat-Scene50.4053.2135.9213.5556.83100 %
w/ FastV 35%49.6552.6435.7713.8056.7499.74 %
w/ FastV 65%22.9128.2629.3612.8456.0174.72 %
w/ FastV 90%3.698.9121.9210.4750.6450.29 %
w/ Fast3D(GAP) 35%50.8453.5535.3213.2956.8699.60 %
w/ Fast3D(GAP) 65%50.8953.6835.0113.4456.3499.53 %
w/ Fast3D(GAP) 90%50.0251.0932.6412.7955.4295.61 %
w/ Fast3D(GAP+SAP) ~90%50.9453.0634.6013.2956.2298.82 %

* w/ method x% indicates results with a x% average visual token pruning ratio.

Preparation

  • Prepare the environment:
wget --quiet https://repo.anaconda.com/miniconda/Miniconda3-latest-Linux-x86_64.sh
bash Miniconda3-latest-Linux-x86_64.sh
conda create -n fast3d python=3.9.17
conda activate fast3d
pip install torch torchvision torchaudio --index-url https://download.pytorch.org/whl/cu121
pip install -r requirements.txt
wget --quiet https://github.com/Dao-AILab/flash-attention/releases/download/v2.7.0.post2/flash_attn-2.7.0.post2+cu12torch2.5cxx11abiFALSE-cp39-cp39-linux_x86_64.whl
pip install flash_attn-2.7.0.post2+cu12torch2.5cxx11abiFALSE-cp39-cp39-linux_x86_64.whl
cd transformers
pip install -e .
  • Download LLM backbone:

    • We use Vicuna-7B v1.5 in our experiments, which can be downloaded from Hugging Face.

    • Change the llama_model_path in config.py to the path of vicuna-7b-v1.5.

  • Annotations and extracted features:

    Please follow the instructions in preprocess.

  • Download Chat-Scene's pretrained checkpoint:

    We provide the pretrained checkpoint in Google Drive. Download it from either Link 1 or Link 2.

Chat-Scene Vanilla Inference

val_tag="scanrefer#scan2cap#scanqa#sqa3d#multi3dref"
evaluate=True
pretrained_path="/path/to/pretrained_model.pth"
  • Run: bash scripts/eval.sh

Inference with FastV

val_tag="scanrefer#scan2cap#scanqa#sqa3d#multi3dref"
evaluate=True
pretrained_path="/path/to/pretrained_model.pth"
# batch eval pruning ratios: 90%, 65%, 35%
rank_list=(15 60 90) # keep from 300 visual tokens
Ks=(2 6 16) # from which layer of 32 layers
  • Run: bash scripts/batch_eval_fastv.sh

Inference with Fast3D (GAP)

1. extract global attention maps from Chat-Scene as the training target of the GAP network.

(You can skip this step and use our provided infer_attn_maps in Google Drive)

train_tag="scanrefer#scan2cap#scanqa#sqa3d#multi3dref"
val_tag="scanrefer#scan2cap#scanqa#sqa3d#multi3dref"
evaluate=True
pretrained_path="/path/to/pretrained_model.pth"
  • Run: bash scripts/extract_attn_maps.sh
2. train the GAP network.

(You can skip this step and use our provided trained_gap_network in Google Drive)

attn_maps_root: /path/to/infer_attn_maps(extracted_attn_maps)
train_tags: scanrefer#scan2cap#scanqa#sqa3d#multi3dref
val_tags: scanrefer#scan2cap#scanqa#sqa3d#multi3dref
  • We use roberta-base in our experiments, which can be downloaded from Hugging Face. Then modify Fast3dNetConfig.roberta_path in modeling_fast3d.py.

  • We use 4ร— NVIDIA RTX 3090 GPUs to train the GAP network. Run:

accelerate config
cd fast3d
bash train_fast3d.sh
3. get predicted attention maps from the trained GAP network.

(You can skip this step and use our provided pred_attn_maps in Google Drive)

eval_only: True
pretrained_model_path: /path/to/checkpoint_best.pth
save_attn_maps: True
attn_maps_root: /path/to/infer_attn_maps(extracted_attn_maps)
val_tags: scanrefer#scan2cap#scanqa#sqa3d#multi3dref
  • Run:
accelerate config
cd fast3d
bash test_fast3d.sh
4. Chat-Scene inference with predicted attention maps.

(Quick start: You can skip the above steps and use our provided predicted attention maps in Google Drive)

val_tag="scanrefer#scan2cap#scanqa#sqa3d#multi3dref"
evaluate=True
pretrained_path="/path/to/pretrained_model.pth"
use_fast_v=False
use_fast_v_oracle=True
use_external_attn_maps=True
use_a_map_ori=False
val_attn_maps_path="/path/to/predicted_attn_maps"
# batch eval pruning ratios: 90%, 65%, 35%
rank_list=(15 60 90)
Ks=(2 6 16)
  • Run: bash scripts/batch_eval_fast3d_pred_attn.sh

Inference with Fast3D (GAP+SAP)

  • manual search total attention score threshold alpha: modify search_alpha.py

    alpha = 0.21
    target_pruning_ratio = 90
    tolerance = 2
    pred_attn_maps_path = "/path/to/predicted_attn_maps"
    

    then run python tasks/search_alpha.py to check if the alpha is valid.

  • Modify batch_eval_fast3d_pred_attn_adaptive.sh: (We provide the predicted attention maps in Google Drive)

    val_tag="scanrefer#scan2cap#scanqa#sqa3d#multi3dref"
    evaluate=True
    pretrained_path="/path/to/pretrained_model.pth"
    use_fast_v=False
    use_fast_v_oracle=True
    use_external_attn_maps=True
    use_a_map_ori=False
    val_attn_maps_path="/path/to/predicted_attn_maps"
    alpha_list=(0.21)
    Ks=(0)
    
  • Run: bash scripts/batch_eval_fast3d_pred_attn_adaptive.sh

๐Ÿ“„ Citation

If you find this project useful in your research, please consider cite:

@misc{huang2025fast3daccelerating3dmultimodal,
      title={Fast3D: Accelerating 3D Multi-modal Large Language Models for Efficient 3D Scene Understanding}, 
      author={Wencan Huang and Daizong Liu and Wei Hu},
      year={2025},
      eprint={2507.09334},
      archivePrefix={arXiv},
      primaryClass={cs.CV},
      url={https://arxiv.org/abs/2507.09334}, 
}

๐Ÿ˜Š Acknowledgement

Thanks to the open source of the following projects: Chat-Scene, FastV, and Vil3dref