RSA: Resolving Scale Ambiguities in Monocular Depth Estimators through Language Descriptions

February 4, 2025 · View on GitHub

Official implementation of the paper "RSA: Resolving Scale Ambiguities in Monocular Depth Estimators through Language Descriptions"

Accepted by NeurIPS 2024

Paper Link: https://arxiv.org/abs/2410.02924

Authors: Ziyao Zeng, Yangchao Wu, Hyoungseob Park, Daniel Wang, Fengyu Yang, Stefano Soatto, Dong Lao, Byung-Woo Hong, Alex Wong

Overview

Overview

Poster

RSA Poster

Setup Environment

Create Virtual Environment:

cd RSA

virtualenv -p /usr/bin/python3.8 ~/venvs/rsa

vim  ~/.bash_profile

Insert the following line to vim:

alias rsa="export CUDA_HOME=/usr/local/cuda-11.1 && source ~/venvs/rsa/bin/activate"

Then activate it, install all packages:

source ~/.bash_profile

rsa

pip install -r requirements.txt

Run training for Depth Anything on NYU-Depth-V2 / KITTI / VOID

Specify GPU Number in sh/train_3d_da.sh, then run by:

sh train_3d_da.sh

Before running new experiments, remember to change the model_name in train_3d_da.sh and config/arguments_train_3d_da.txt to be the same.

Run training for MiDaS on NYU-Depth-V2 / KITTI / VOID

Specify GPU Number in sh/train_3d_midas.sh, then run by:

sh train_3d_midas.sh

Before running new experiments, remember to change the model_name in train_3d_midas.sh and config/arguments_train_3d_midas.txt to be the same.

Run training for DPT on NYU-Depth-V2 / KITTI / VOID

Specify GPU Number in sh/train_3d_da.sh, then run by:

sh train_3d_da.sh

Before running new experiments, remember to change the model_name in train_3d_da.sh and config/arguments_train_3d_da.txt to be the same.

Setup Datasets

Prepare Datasets

Download NYU-Depth-v2, KITTI, and VOID. Or you can refer to KBNet to prepare datasets through the provided download scripts.

For VOID. Please use VOID-1500. The number indicates the number of point for depth completion, and it doesn't affect this task. Then change the data_path and gt_path in train_3d.py and DATA_PATH_VOID in eval_void/data_utils.py

The structure of dataset should look like this:

├── nyu_depth_v2
│   ├── official_splits             # path to nyu-depth-v2 data_path_eval and gt_path_eval
│   │   ├── test
│   │   │   ├── bathroom
│   │   │   │   ├── rgb_00045.jpg
│   │   │   │   ├── rgb_00046.jpg
│   │   │   │   ├── ...
│   │   ├── train                    # We don't use this part
│   │   │   ├── ...
│   ├── sync                           # path to nyu-depth-v2 data_path and gt_path
│   │   ├── basement_0001a
│   │   │   ├── rgb_00000.jpg
│   │   │   ├── rgb_00001.jpg
│   │   │   ├── ...
└── ...

├── kitti_raw_data                     # path to kitti data_path and data_path_eval
│   ├── 2011_09_26                     # name of dataset
│   │   ├── 2011_09_26_drive_0001_sync
│   │   │   ├── ...
└── ...


├── kitti_ground_truth                 # path to kitti gt_path and gt_path_eval
│   ├── 2011_09_26_drive_0001_sync
│   │   ├── ...
└── ...

├── XXX                               # DATA_PATH_VOID in eval_void/data_utils.py
│   ├── void_1500                     # path to void dataset
│   │   ├── data
│   │   │   ├── birthplace_of_internet
│   │   │   │   ├── ...
│   └── ...

Evaluate Checkpoints

Checkpoints are put under ./checkpoints. To run evaluation only, specify args.load_ckpt_path and args.eval_before_train to conduct evaluation.

Acknowledgements

We would like to acknowledge the use of code snippets from various open-source libraries and contributions from the online coding community, which have been invaluable in the development of this project. Specifically, we would like to thank the authors and maintainers of the following resources:

CLIP

WorDepth

Depth Anything

MiDaS

DPT

MaskDINO

LLaVA

KBNet

VOID