RRT-MVS

November 4, 2025 · View on GitHub

Paper | Pretrained Models | Point Clouds

RRT-MVS: Recurrent Regularization Transformer for Multi-View Stereo
Authors: Jianfei Jiang, Liyong Wang, Haochen Yu, Tianyu Hu, Jiansheng Chen, Huimin Ma^*
Institute: University of Science and Technology Beijing
AAAI 2025

Abstract

Learning-based multi-view stereo methods aim to predict depth maps for reconstructing dense point clouds. These methods rely on regularization to reduce redundancy in the cost volume. However, existing methods have limitations: CNN-based regularization is restricted to local receptive fields, while Transformer-based regularization struggles with handling depth discontinuities. These limitations often result in inaccurate depth maps with significant noise, particularly noticeable in the boundary and background regions. In this paper, we propose a Recurrent Regularization Transformer for Multi-View Stereo (RRT-MVS), which addresses these limitations by regularizing the cost volume separately for depth and spatial dimensions. Specifically, we introduce Recurrent Self-Attention (R-SA) to aggregate global matching costs within and across the cost maps and filter out noisy feature correlations. Additionally, we present Depth Residual Attention (DRA) to aggregate depth correlations within the cost volume and a Positional Adapter (PA) to enhance 3D positional awareness in each 2D cost map, further augmenting the effectiveness of R-SA. Experimental results demonstrate that RRT-MVS achieves state-of-the-art performance on the DTU and Tanks-and-Temples datasets. Notably, RRT-MVS ranks first on both the Tanks-and-Temples intermediate and advanced benchmarks among all published methods.

Recurrent Regularization Transformer

Installation

conda create -n rrtmvsnet python=3.10.8
conda activate rrtmvsnet
pip install -r requirements.txt
pip install torch==1.13.1+cu116 torchvision==0.14.1+cu116 torchaudio==0.13.1 -f https://download.pytorch.org/whl/torch_stable.html

Data Preparation

1. DTU Dataset

Training data. We use the same DTU training data as mentioned in MVSNet and CasMVSNet. Download DTU training data and Depth raw. Unzip and organize them as:

dtu_training                     
    ├── Cameras                
    ├── Depths   
    ├── Depths_raw
    └── Rectified

Testing Data. Download DTU testing data. Unzip it as:

dtu_testing                                       
    ├── scan1   
    ├── scan4
    ├── ...

2. BlendedMVS Dataset

Download BlendedMVS and unzip it as:

blendedmvs                          
    ├── 5a0271884e62597cdee0d0eb                
    ├── 5a3ca9cb270f0e3f14d0eddb   
    ├── ...
    ├── training_list.txt
    ├── ...

3. Tanks and Temples Dataset

Download Tanks and Temples processed data by ET-MVSNet and MVSFormer++ and unzip it as:

tanksandtemples_1                          
       ├── advanced                 
       │   ├── Auditorium       
       │   ├── ...  
       └── intermediate
           ├── Family       
           ├── ...

Training

Training on DTU (NVIDIA RTX 3090 GPUs, 24G)

To train the model on DTU, specify DTU_TRAINING in ./scripts/train_dtu.sh first and then run:

bash scripts/train_dtu.sh exp_name

After training, you will get model checkpoints in ./checkpoints/dtu.

Finetune on BlendedMVS (NVIDIA RTX A6000 GPUs, 48G)

To fine-tune the model on BlendedMVS, you need specify BLD_TRAINING and BLD_CKPT_FILE in ./scripts/train_bld.sh first, then run:

bash scripts/train_bld.sh exp_name

Testing

Testing on DTU

For DTU testing, we use the model (pretrained model) trained on DTU training dataset. Specify DTU_TESTPATH and DTU_CKPT_FILE in ./scripts/test_dtu.sh first, then run the following command to generate point cloud results.

bash scripts/test_dtu.sh exp_name

For quantitative evaluation, download SampleSet and Points from DTU's website. Unzip them and place Points folder in SampleSet/MVS Data/. The structure is just like:

SampleSet
├──MVS Data
      └──Points

Specify datapath, plyPath, resultsPath in evaluations/dtu/BaseEvalMain_web.m and datapath, resultsPath in evaluations/dtu/ComputeStat_web.m, then run the following command to obtain the quantitative metics.

cd evaluations/dtu
matlab -nodisplay
BaseEvalMain_web 
ComputeStat_web

The matlab evaluation code is slow, we recommend Fast DTU Evaluation Using GPU with Python for fast validation, and the final result reported in paper is obtained by official matlab code.

Testing on Tanks and Temples

We recommend using the finetuned model (pretrained model) to test on Tanks and Temples benchmark. Similarly, specify TNT_TESTPATH and TNT_CKPT_FILE in scripts/test_tnt_inter.sh and scripts/test_tnt_adv.sh. To generate point cloud results, just run:

bash scripts/test_tnt_inter.sh exp_name

bash scripts/test_tnt_adv.sh exp_name

For quantitative evaluation, you can upload your point clouds to Tanks and Temples benchmark.

DTU	Acc. ↓	Comp. ↓	Overall ↓
Ours	0.309	0.261	0.285

T&T (Inter.)	Mean ↑	Family	Francis	Horse	Lighthouse	M60	Panther	Playground	Train
Ours	68.16	82.54	72.31	61.44	69.89	65.35	68.88	64.45	60.48

T&T (Adv.)	Mean ↑	Auditorium	Ballroom	Courtroom	Museum	Palace	Temple
Ours	43.29	30.95	46.42	41.13	55.46	37.63	48.12

You can download reconstructed point clouds on both DTU and T&T here.

Citation

If you find this work useful in your research, please consider citing the following:

@inproceedings{jiang2025rrt,
  title={RRT-MVS: Recurrent Regularization Transformer for Multi-View Stereo},
  author={Jiang, Jianfei and Wang, Liyong and Yu, Haochen and Hu, Tianyu and Chen, Jiansheng and Ma, Huimin},
  booktitle={Proceedings of the AAAI Conference on Artificial Intelligence},
  volume={39},
  number={4},
  pages={3994--4002},
  year={2025}
}

Acknowledgements

Our work is partially based on these opening source work:

We appreciate their contributions to the MVS community.