MonoMVSNet

December 16, 2025 · View on GitHub

Arxiv | Pretrained Models

MonoMVSNet: Monocular Priors Guided Multi-View Stereo Network
Authors: Jianfei Jiang, Qiankun Liu*, Haochen Yu, Hongyuan Liu, Liyong Wang, Jiansheng Chen, Huimin Ma*
Institute: University of Science and Technology Beijing
ICCV 2025

Learning-based Multi-View Stereo (MVS) methods aim to predict depth maps for a sequence of calibrated images to recover dense point clouds. However, existing MVS methods often struggle with challenging regions, such as textureless regions and reflective surfaces, where feature matching fails. In contrast, monocular depth estimation inherently does not require feature matching, allowing it to achieve robust relative depth estimation in these regions. To bridge this gap, we propose MonoMVSNet, a novel monocular feature and depth guided MVS network that integrates powerful priors from a monocular foundation model into multi-view geometry. Firstly, the monocular feature of the reference view is integrated into source view features by the attention mechanism with a newly designed cross-view position encoding. Then, the monocular depth of the reference view is aligned to dynamically update the depth candidates for edge regions during the sampling procedure. Finally, a relative consistency loss is further designed based on the monocular depth to supervise the depth prediction. Extensive experiments demonstrate that MonoMVSNet achieves state-of-the-art performance on the DTU and Tanks-and-Temples datasets, ranking first on the Tanks-and-Temples Intermediate and Advanced benchmarks.

🚀Installation

conda create -n monomvsnet python=3.10.8
conda activate monomvsnet
pip install -r requirements.txt

To reproduce the GPU memory consumption described in the paper, you need install xformers.

⭐Data Preparation

Please refer to RRT-MVS.

You need download pretrained weights depth_anything_v2_vits and TEED_model , then place them in the folder pre_trained_weights

🦴Training

Training on DTU

To train the model on DTU, specify DTU_TRAINING in ./scripts/train_dtu.sh first and then run:

bash scripts/train_dtu.sh

After training, you will get model checkpoints in ./checkpoints/dtu.

Finetune on BlendedMVS

To fine-tune the model on BlendedMVS, you need specify BLD_TRAINING and BLD_CKPT_FILE in ./scripts/train_bld.sh first, then run:

bash scripts/train_bld.sh

After finetuning, you will get model checkpoints in ./checkpoints/bld_ft.

👀Testing

Testing on DTU

For DTU testing, we use the model (dtu_best) trained on DTU training dataset, place it in the folder ./checkpoints/dtu. Specify DTU_TESTPATH and DTU_CKPT_FILE in ./scripts/test_dtu.sh first, then run the following command to generate point cloud results.

bash scripts/test_dtu_dypcd.sh

For ablation study of Table 3, we use:

bash scripts/test_dtu_pcd.sh

Testing on Tanks and Temples

We recommend using the finetuned model (bld_best) to test on Tanks and Temples benchmark, place it in the folder ./checkpoints/bld_ft. Similarly, specify TNT_TESTPATH and TNT_CKPT_FILE in scripts/test_tnt_inter.sh and scripts/test_tnt_adv.sh. To generate point cloud results, just run:

bash scripts/test_tnt_inter.sh

bash scripts/test_tnt_adv.sh

For quantitative evaluation, you can upload your point clouds to Tanks and Temples benchmark.

💪Results

Quantitative Results on DTU

DTU	Acc. ↓	Comp. ↓	Overall ↓
Ours (N=5)	0.313	0.243	0.278
Ours (N=9)	0.302	0.248	0.275

Quantitative Results on Tanks-and-Temples

Inter.	Mean ↑	Family	Francis	Horse	Lighthouse	M60	Panther	Playground	Train
Ours	68.63	82.38	72.89	62.80	70.49	65.79	68.54	65.54	60.59

Adv.	Mean ↑	Auditorium	Ballroom	Courtroom	Museum	Palace	Temple
Ours	43.58	30.33	46.76	42.90	56.31	37.28	47.88

🤝Citation

If you find this work useful in your research, please consider citing the following:

@inproceedings{monomvsnet,
    author    = {Jiang, Jianfei and Liu, Qiankun and Yu, Haochen and Liu, Hongyuan and Wang, Liyong and Chen, Jiansheng and Ma, Huimin},
    title     = {MonoMVSNet: Monocular Priors Guided Multi-View Stereo Network},
    booktitle = {Proceedings of the IEEE/CVF International Conference on Computer Vision (ICCV)},
    month     = {October},
    year      = {2025},
    pages     = {27806-27816}
}

🫶Acknowledgements

Our work is partially based on these opening source works ET-MVSNet, TransMVSNet, MVSFormer++, Depth Anything V2, and TEED. We appreciate their contributions to the MVS community.