README.md
October 21, 2025 · View on GitHub
LLaVA-3D: A Simple yet Effective Pathway to Empowering LMMs with 3D Capabilities
Chenming Zhu
Tai Wang*
Wenwei Zhang
Jiangmiao Pang
Xihui Liu*
The University of Hong Kong Shanghai AI Laboratory
🏠 Introducing LLaVA-3D
🔥 News
- [2025-07-11] :hearts: Our paper is accepted by ICCV 2025! See u in Hawaii! We release the full
LLaVA-3D-Instruct-86OKdata on HuggingFace! - [2024-11-29] We update the custom data instruction tuning tutorial, now you can train the model on your own dataset!
- [2024-10-19] We release the inference codes with checkpoints as well as the image and 3D scene demos. You can chat with LLaVA-3D with your own machines.
- [2024-09-28] We release the paper of LLaVA-3D. 🎉
📋 Contents
- 🔍 Model Architecture
- 🔨 Install
- 📦 Model Zoo
- 🤖 Demo
- 📝 TODO List
- 🔗 Citation
- 📄 License
- 👏 Acknowledgements
🔍 Model Architecture
🔨 Install
We test our codes under the following environment:
- Python 3.10
- Pytorch 2.1.0
- CUDA Version 11.8
To start:
- Clone this repository.
git clone https://github.com/ZCMax/LLaVA-3D.git
cd LLaVA-3D
- Install Packages
conda create -n llava-3d python=3.10 -y
conda activate llava-3d
pip install torch==2.1.0 torchvision==0.16.0 torchaudio==2.1.0 --index-url https://download.pytorch.org/whl/cu118
pip install torch-scatter -f https://data.pyg.org/whl/torch-2.1.0+cu118.html
pip install -e .
-
Download the Camera Parameters File and put the json file under the
./playground/data/annotations. -
Install additional packages for training cases
pip install -e ".[train]"
pip install flash-attn --no-build-isolation
📦 Model Zoo
The trained model checkpoints are available here. Currently we only provide the 7B model, and we will continue to update the model zoo.
🤖 Demo
We currently support single image as inputs for 2D tasks and posed RGB-D images as inputs for 3D tasks. You can run the demo by using the script llava/eval/run_llava_3d.py. For 2D tasks, use the image-file parameter, and for 3D tasks, use the video-path parameter to provide the corresponding data. Here, we provide some demos as examples:
2D Tasks
python llava/eval/run_llava_3d.py \
--model-path ChaimZhu/LLaVA-3D-7B \
--image-file https://llava-vl.github.io/static/images/view.jpg \
--query "What are the things I should be cautious about when I visit here?"
3D Tasks
We provide the demo scene here. Download the demo data and put it under the ./demo.
- 3D Question Answering
python llava/eval/run_llava_3d.py \
--model-path ChaimZhu/LLaVA-3D-7B \
--video-path ./demo/scannet/scene0356_00 \
--query "Tell me the only object that I could see from the other room and describe the object."
- 3D Dense Captioning
python llava/eval/run_llava_3d.py \
--model-path ChaimZhu/LLaVA-3D-7B \
--video-path ./demo/scannet/scene0566_00 \
--query "The related object is located at [0.981, 1.606, 0.430]. Describe the object in detail."
- 3D Localization
python llava/eval/run_llava_3d.py \
--model-path ChaimZhu/LLaVA-3D-7B \
--video-path ./demo/scannet/scene0382_01 \
--query "The related object is located at [-0.085,1.598,1.310]. Please output the 3D bounding box of the object and then describe the object."
📝 TODO List
- [x] Release the training and inference code.
- [x] Release the checkpoint, demo data and script.
- [x] Release the training datasets.
- [ ] Release the full code.
🔗 Citation
If you find our work and this codebase helpful, please consider starring this repo 🌟 and cite:
@article{zhu2024llava,
title={LLaVA-3D: A Simple yet Effective Pathway to Empowering LMMs with 3D-awareness},
author={Zhu, Chenming and Wang, Tai and Zhang, Wenwei and Pang, Jiangmiao and Liu, Xihui},
journal={arXiv preprint arXiv:2409.18125},
year={2024}
}
📄 License
This work is under the Creative Commons Attribution-NonCommercial-ShareAlike 4.0 International License.