README.md

March 28, 2025 · View on GitHub

MuDG: Taming Multi-modal Diffusion with Gaussian Splatting for Urban Scene Reconstruction

Yingshuang Zou, Yikang Ding, Chuanrui Zhang, Jiazhe Guo, Bohan Li,
Xiaoyang Lyu, Feiyang Tan, Xiaojuan Qi, Haoqian Wang

🔆 News

🔥🔥 (2025.03) Check out our other latest works on generative world models: UniScene, DiST-4D, HERMES.

🔥🔥 (2025.03) The data processing code is released!

🔥🔥 (2025.03) The training and inference code of Multi-modal Diffusion is available NOW!!!

🔥🔥 (2025.03) Paper in on arXiv: MuDG: Taming Multi-modal Diffusion with Gaussian Splatting for Urban Scene Reconstruction

📝 TODO List

Release data processing code.
Release the pretrained model.
Release training / inference code.

Recent breakthroughs in radiance fields have significantly advanced 3D scene reconstruction and novel view synthesis (NVS) in autonomous driving. Nevertheless, critical limitations persist: reconstruction-based methods exhibit substantial performance deterioration under significant viewpoint deviations from training trajectories, while generation-based techniques struggle with temporal coherence and precise scene controllability. To overcome these challenges, we present MuDG, an innovative framework that integrates Multi-modal Diffusion model with Gaussian Splatting (GS) for Urban Scene Reconstruction. MuDG leverages aggregated LiDAR point clouds with RGB and geometric priors to condition a multi-modal video diffusion model, synthesizing photorealistic RGB, depth, and semantic outputs for novel viewpoints. This synthesis pipeline enables feed-forward NVS without computationally intensive per-scene optimization, providing comprehensive supervision signals to refine 3DGS representations for rendering robustness enhancement under extreme viewpoint changes. Experiments on the Open Waymo Dataset demonstrate that MuDG outperforms existing methods in both reconstruction and synthesis quality.

🧰 Models

Model	Resolution	Checkpoint
MDM1024	576x1024	Hugging Face
MDM512	320x512	Hugging Face

⚙️ Setup

Install Environment via Anaconda (Recommended)

conda create -n mudg python=3.8.5
conda activate mudg
pip install -r requirements.txt

💫 Inference for Novel View Viewpoint

1. Sparse Conditional Generation

We project the fused point clouds onto novel viewpoints to generate sparse color and depth maps.

Note: The detailed data processing steps can be found in the Data Processing section.

For your convenience, we have also provided pre-processed data. You can access it via this link.

2. Generate item list

python virtual_render/generate_virtual_item.py

Download pretrained models, and put the model.ckpt with the required resolution in checkpoints/[1024|512]_mdm/[1024|512]-mdm-checkpoint.ckpt.
Run the commands based on your devices and needs in terminal.

  sh virtual_render/scripts/render.sh 15365

15365 is the item id, and you can change it to any item id following the item list.

💥 Training

Novel View Generation

Process the data and generate the item list.
Generate the train data list:

python data/create_data_infos.py

Download the pretrained model DynamiCrafter512 and put the model.ckpt in checkpoints/512_mdm/512-mdm-checkpoint.ckpt.
We train the 320 * 512 model with the following command:

  sh configs/stage1-512_mdm_waymo/run-512.sh

Then we use the following command to train the 576 * 1024 model:

  sh configs/stage2-1024_mdm_waymo/run-1024.sh

📜 License

This repository is released under the Apache 2.0 license.

😉 Citation

Please consider citing our paper if our code are useful:

@article{zou2025mudg,
  title={MuDG: Taming Multi-modal Diffusion with Gaussian Splatting for Urban Scene Reconstruction},
  author={Zou, Yingshuang and Ding, Yikang and Zhang, Chuanrui and Guo, Jiazhe and Li, Bohan and Lyu, Xiaoyang and Tan, Feiyang and Qi, Xiaojuan and Wang, Haoqian},
  journal={arXiv preprint arXiv:2503.10604},
  year={2025}
}

🙏 Acknowledgements

We would like to thank the contributors of the following repositories for their valuable contributions to the community: