DTTDNet: Robust Digital-Twin Localization via An RGBD-based Transformer Network and A Comprehensive Evaluation on a Mobile Dataset
April 9, 2026 Β· View on GitHub
This repository is the implementation code of the 2025 CVPR workshop (Mobile AI) paper "Robust 6DoF Pose Estimation Against Depth Noise and a Comprehensive Evaluation on a Mobile Dataset"
Are current 3D object tracking methods truely robust enough for low-fidelity depth sensors like the iPhone LiDAR?
We introduce DTTD-Mobile, a new benchmark built on real-world data captured from mobile devices. We evaluate several popular methodsβincluding BundleSDF, ES6D, MegaPose, and DenseFusionβand highlight their limitations in this challenging setting. To go a step further, we propose DTTD-Net with a Fourier-enhanced MLP and a two-stage attention-based fusion across RGB and depth, making 6DoF pose estimation more robustβeven when the input is noisy, blurry, or partially occluded.
Dataset: Checkout DTTD-Mobile and the Robotics Dataset Extension.
Updates:
- [05/01/25] The archival version of this work will be presented at 2025 CVPRW: Mobile AI.
- [11/05/24] We extended DTTD for specific grasping and insertion tasks using FANUC robotic arm, released here. Feel free to contact zixun@berkeley.edu and xiang_zhang_98@berkeley.edu for details on this dataset extension.
- [11/05/24] The DTTD-Mobile dataset has been migrated to huggingface due to our Google Drive storage issues, check here.
- [09/10/24] Our MoCap data pipeline has been released, check here (iPhone-ARKit-based version) for your customized data collection and annotation. For the release of our data capture app for iPhone, check here. For our previous released Azure-based version, check here.
- [06/17/24] Our work has been accepted at 2024 ICML workshop: Data-centric Machine Learning Research. demo video, openreview
- [09/28/23] Our trained checkpoints for pose estimator are released here.
- [09/27/23] Our dataset: DTTD-Mobile is released.
Citation
If our work is useful or relevant to your research, please kindly recognize our contributions by citing our papers:
@InProceedings{dttd2,
author = {Huang, Zixun and Yao, Keling and Zhao, Zhihao and Pan, Chuanyu and Yang, Allen},
title = {DTTDNet: Robust 6DoF Pose Estimation Against Depth Noise and a Comprehensive Evaluation on a Mobile Dataset},
booktitle = {Proceedings of the Computer Vision and Pattern Recognition Conference (CVPR) Workshops},
month = {June},
year = {2025},
pages = {1848-1857}
}
@InProceedings{dttd1,
author = {Feng, Weiyu and Zhao, Seth Z. and Pan, Chuanyu and Chang, Adam and Chen, Yichen and Wang, Zekun and Yang, Allen Y.},
title = {Digital Twin Tracking Dataset (DTTD): A New RGB+Depth 3D Dataset for Longer-Range Object Tracking Applications},
booktitle = {Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR) Workshops},
month = {June},
year = {2023},
pages = {3288-3297}
}
Dependencies:
Before running our pose estimation pipeline, you can activate a conda environment where Python Version >= 3.8:
conda create --name [YOUR ENVIR NAME] python = [PYTHON VERSION]
conda activate [YOUR ENVIR NAME]
then install all necessary packages:
torch
torchvision
torchaudio
numpy
einops
pillow
scipy
opencv_python
tensorboard
tqdm
For knn module used in ADD-S loss, install KNN_CUDA: https://github.com/pptrick/KNN_CUDA. (Install KNN_CUDA requires CUDA environment, ensure that your CUDA version >= 10.2. Also, It only supports torch v1.0+.)
Load Dataset and Checkpoints:
Download our checkpoints and datasets, then organize the file structure:
Robust-Digital-Twin-Tracking
βββ checkpoints
β βββ m8p4.pth
β βββ m8p4_filter.pth
β βββ ...
| ...
βββ dataset
βββ dttd_iphone
βββ dataset_config
βββ dataset.py
βββ DTTD_IPhone_Dataset
βββ root
βββ cameras
β βββ az_camera1 (if you want to train our algorithm with DTTD v1)
β βββ iphone14pro_camera1
β βββ ZED2 (to be released...)
βββ data
β βββ scene1
β β βββ data
β β β βββ 00001_color.jpg
β β β βββ 00001_depth.png
β β β βββ ...
| β βββ scene_meta.yaml
β βββ scene2
β β βββ data
| β βββ scene_meta.yaml
β ...
βββ objects
βββ apple
β βββ apple.mtl
β βββ apple.obj
β βββ front.xyz
β βββ points.xyz
β βββ ...
β βββ textured.obj.mtl
βββ black_expo_marker
βββ ...
Run Estimation:
This repository contains scripts for 6dof object pose estimation (end-to-end coarse estimation). To run estimation, please make sure you have installed all the dependencies.
To run dttd-net (either training or evaluation), first download the dataset. It is recommended to create a soft link to dataset/dttd_iphone/ folder using:
ln -s <path to dataset>/DTTD_IPhone_Dataset ./dataset/dttd_iphone/
To run trained estimator with test dataset, move to ./eval/. For example, to evaluate on dttd v2 dataset:
cd eval/dttd_iphone/
bash eval.sh
You can customize your own eval command, for example:
python eval.py --dataset_root ./dataset/dttd_iphone/DTTD_IPhone_Dataset/root\
--model ./checkpoints/m2p1.pth\
--base_latent 256 --embed_dim 512 --fusion_block_num 1 --layer_num_m 2 --layer_num_p 1\
--visualize --output eval_results_m8p4_model_filtered_best\
To load model with filter-enhanced MLP, please add flag --filter.
To visualize the attention map or/and the reduced geometric embeddings' distribution, you can add flag --debug.
Eval:
This is the ToolBox that we used for the experiment result evaluation and comparison.
Train:
To run training of our method, you can use:
python train.py --device 0 \
--dataset iphone --dataset_root ./dataset/dttd_iphone/DTTD_IPhone_Dataset/root --dataset_config ./dataset/dttd_iphone/dataset_config \
--output_dir ./result/result \
--base_latent 256 --embed_dim 512 --fusion_block_num 1 --layer_num_m 8 --layer_num_p 4 \
--recon_w 0.3 --recon_choice depth \
--loss adds --optim_batch 4 \
--start_epoch 0 \
--lr 1e-5 --min_lr 1e-6 --lr_rate 0.3 --decay_margin 0.033 --decay_rate 0.77 --nepoch 60 --warm_epoch 1 \
--filter_enhance \
To train a smaller model, you can set flags --layer_num_m 2 --layer_num_p 1.
To enable our method with depth robustifying modules, you can add flags --filter_enhance or/and --recon_choice model.
To adjust the weight of Chamfer Distance Loss to 0.5, you can set flags --reon_w 0.5.
Our model is applicable on YCBV_Dataset and DTTD_v1 as well, please try following commands to run training of our method with other datasets (please ensure you download the dataset that you want to train on):
python train.py --dataset ycb --output_dir ./result/train_result --device 0 --batch_size 1 --lr 8e-5 --min_lr 8e-6 --warm_epoch 1
python train.py --dataset dttd --output_dir ./result/train_result --device 0 --batch_size 1 --lr 1e-5 --min_lr 1e-6 --warm_epoch 1