README.md

March 22, 2026 · View on GitHub

SceneCOT: Eliciting Grounded Chain-of-Thought Reasoning in 3D Scenes

ICLR 2026

Xiongkun Linghu, Jiangyong Huang, Ziyu Zhu, Baoxiong Jia, Siyuan Huang

SceneCOT: We propose a Chain-of-Thought reasoning method in 3D scenes (SceneCOT), decoupling a complex reasoning task into simpler and manageable problems, and building corresponding visual clues based on multimodal expert modules. To our knowledge, this is the first attempt to successfully implement the COT technique for achieving human-like step-by-step reasoning for 3D scene understanding, where we show great potential in extending it to a wider range of 3D scene understanding scenarios.

SceneCOT Framework

SceneCOT achieves great performance on MSQA, and Beacon3D, demonstrating the effectiveness of our reasoning framework. Especially, our method significanlty enhances the performance on counting, the most challenging task in MSQA. Our method also significanlty outperforms previous methods by a large margin in Beacon3D.

🔥 News

[2026-3] Evaluation code, model checkpoints, detailed installation instruction have been released
[2026-3] We release training code
[2026-1] SceneCOT is accepted by ICLR 2026
[2025-6] We released the webpage of SceneCOT

🚀 Get Started

Clone the repository.

git clone https://github.com/SceneCOT/scenecot
cd scenecot

Create a Python environment and install dependencies.

conda create -n scenecot python=3.9
conda activate scenecot

# PyTorch (example tested version)
conda install pytorch==2.4.1 torchvision==0.19.1 torchaudio==2.4.1 pytorch-cuda=11.8 -c pytorch -c nvidia

# project dependencies
pip install -r requirements.txt

Install point-cloud third-party modules.

pip install spconv-cu118

cd model/pointnetpp
python setup.py install
cd ../..

# sanity check
python -c 'from model.pointnetpp.pointnetpp import PointNetPP'

If PointNext build/import fails, either disable PointNext usage or place the compiled file from LEO_data under model/pointnext/cpp/pointnet2_batch/.

🔧 Reproducibility configuration

The configs were updated to avoid machine-specific absolute paths. We recommend setting the following environment variables:

Variable	Purpose	Default	Download / Source Link
`SCENECOT_EXP_ROOT`	experiment output root (`cfg.base_dir`)	`./outputs`	-
`SCENECOT_DATA_ROOT`	root directory for dataset/assets used by configs/data/default.yaml	`./data_assets`	SceneCOT dataset
`SCENECOT_COT_DATA_ROOT`	root directory for released COT annotations (`MSQA/`, `GQA3D/`)	`${SCENECOT_DATA_ROOT}/scenecot_cot_data`	SceneCOT dataset / scenecot_cot_data
`SCENECOT_MSR3D_ANNO_DIR`	MSQA annotation directory (contains `situated_qa_{train,val,test}_pure_txt.json`)	`${SCENECOT_COT_DATA_ROOT}/MSQA`	MSQA
`SCENECOT_GQA3D_ANNO_DIR`	GQA3D annotation directory (contains `gqa3d_{train,val,test}.json`)	`${SCENECOT_COT_DATA_ROOT}/GQA3D`	GQA3D
`HF_HOME`	Hugging Face cache root (`cfg.hf_home`)	`./.cache/huggingface`	Hugging Face Hub
`SCENECOT_MODEL_ROOT`	unified root directory for default model/checkpoint paths	`./model_assets`	SceneCOT models
`SCENECOT_LLM_PATH`	LLaVA model path (override)	`${SCENECOT_MODEL_ROOT}/llava-v1.5-7b`	LLaVA-1.5-7B
`SCENECOT_VISION_TOWER_PATH`	CLIP vision tower path (override)	`${SCENECOT_MODEL_ROOT}/clip-vit-large-patch14-336`	CLIP ViT-L/14-336
`SCENECOT_PQ3D_TOKENIZER_PATH`	PQ3D text tokenizer path (override, `data.pq3d_tokenizer_path`)	`${SCENECOT_MODEL_ROOT}/clip-vit-large-patch14`	SceneCOT models
`SCENECOT_POINTNET_TOKENIZER_PATH`	PQ3D PointNet++ tokenizer checkpoint (override)	`${SCENECOT_MODEL_ROOT}/pointnet_tokenizer.pth`	SceneCOT models
`SCENECOT_QUERY3D_PRETRAIN_PATH`	PQ3D/SceneVerse pretrain checkpoint (override)	`${SCENECOT_MODEL_ROOT}/query3d_pretrain.bin`	SceneCOT models
`SCENECOT_EXPERT1_PATH`	MOE expert-1 checkpoint directory (override)	`${SCENECOT_MODEL_ROOT}/expert1_checkpoint0`	SceneCOT model repo (checkpoint dirs)
`SCENECOT_EXPERT2_PATH`	MOE expert-2 checkpoint directory (override)	`${SCENECOT_MODEL_ROOT}/expert2_best.pth`	SceneCOT model repo (checkpoint dirs)

Example:

export SCENECOT_EXP_ROOT=/path/to/experiments
export SCENECOT_DATA_ROOT=/path/to/data_assets
export SCENECOT_COT_DATA_ROOT=/path/to/data_assets/scenecot_cot_data
export SCENECOT_MSR3D_ANNO_DIR=/path/to/data_assets/scenecot_cot_data/MSQA
export SCENECOT_GQA3D_ANNO_DIR=/path/to/data_assets/scenecot_cot_data/GQA3D
export HF_HOME=/path/to/hf_cache
export SCENECOT_MODEL_ROOT=/path/to/model_assets

# Optional explicit overrides when using non-default file names/locations
# export SCENECOT_LLM_PATH=/path/to/model_assets/llava-v1.5-7b
# export SCENECOT_VISION_TOWER_PATH=/path/to/model_assets/clip-vit-large-patch14-336
# export SCENECOT_PQ3D_TOKENIZER_PATH=/path/to/model_assets/clip-vit-large-patch14
# export SCENECOT_EXPERT1_PATH=/path/to/model_assets/expert1_checkpoint0
# export SCENECOT_EXPERT2_PATH=/path/to/model_assets/expert2_best.pth

📦 Pretrained weights

To reproduce paper-level performance, the following checkpoints are needed:

SceneCOT experts (released): SceneCOT model repo
PQ3D PointNet++ tokenizer (pointnet_tokenizer.pth) → set SCENECOT_POINTNET_TOKENIZER_PATH
Query3D/SceneVerse pretrain (pytorch_model.bin) → set SCENECOT_QUERY3D_PRETRAIN_PATH

For MOE evaluation, expert checkpoints are expected as directories under SCENECOT_MODEL_ROOT:

${SCENECOT_MODEL_ROOT}/
├── expert1_checkpoint0/
│   └── pytorch_model.bin (or model.safetensors)
└── expert2_best.pth/
  └── pytorch_model.bin (or model.safetensors)

These map to:

moe.expert1_path → ${SCENECOT_MODEL_ROOT}/expert1_checkpoint0 (or SCENECOT_EXPERT1_PATH)
moe.expert2_path → ${SCENECOT_MODEL_ROOT}/expert2_best.pth (or SCENECOT_EXPERT2_PATH)

By default, 2/3 are resolved under SCENECOT_MODEL_ROOT. If files are absent, related modules are initialized without those pretrained weights, which may significantly affect final metrics.

🌐 External services

Weights & Biases

Tracking is enabled by default. For evaluation-only/offline runs without login:

export WANDB_MODE=disabled

Hugging Face access

If direct access to huggingface.co is restricted, set a mirror endpoint and keep a local cache:

export HF_ENDPOINT=https://your-hf-mirror
export HF_HOME=/path/to/hf_cache

📁 Data preparation

Download released dataset assets from SceneCOT dataset.
Place all downloaded data under one root directory, for example:

/path/to/data_assets

Set:

export SCENECOT_DATA_ROOT=/path/to/data_assets

configs/data/default.yaml resolves paths from SCENECOT_DATA_ROOT as:

${SCENECOT_DATA_ROOT}/SceneVerse → data.sceneverse_base
${SCENECOT_DATA_ROOT}/leo2-cot → data.cot_annotation_base
${SCENECOT_DATA_ROOT}/scan_family → data.scan_family_base
${SCENECOT_DATA_ROOT}/LEO-2_feature/ScanNet → data.obj_feat_2d_base.ScanNet
${SCENECOT_DATA_ROOT}/scene-verse-pred-all/ScanNet → data.obj_feat_base.ScanNet
${SCENECOT_DATA_ROOT}/scenecot_imgs/imgs/scannet → data.obj_img_base.ScanNet

COT annotation paths are resolved clearly as:

${SCENECOT_COT_DATA_ROOT}/MSQA (or SCENECOT_MSR3D_ANNO_DIR) → data.msr3d_anno_dir, data.cotqa.msr3d.anno_dir
${SCENECOT_COT_DATA_ROOT}/GQA3D (or SCENECOT_GQA3D_ANNO_DIR) → data.gqa3d_anno_dir, data.cotqa.gqa3d.anno_dir

Expected folder layout:

${SCENECOT_COT_DATA_ROOT}/
├── MSQA/
│   ├── situated_qa_train_pure_txt.json
│   ├── situated_qa_val_pure_txt.json
│   └── situated_qa_test_pure_txt.json
└── GQA3D/
    ├── gqa3d_train.json
    ├── gqa3d_val.json
    └── gqa3d_test.json

Download released checkpoints from SceneCOT models, and set optional PQ3D checkpoint envs if available.

🕹 Training and evaluation

Training:

sh scripts/train/full_training_msqa_gqa3d.sh

Evaluation (MOE test script):

sh scripts/test/full_training_msqa_beacon3d_test_moe.sh

📊 Offline evaluation

Download evaluation_assets from HF evaluation assets.
Set optional variables:

export SCENECOT_EVAL_ASSETS=/path/to/evaluation_assets
export SCENECOT_EVAL_ROOT=/path/to/experiments

Run:

python evaluator/msqa_evaluator_offline.py

Expected prediction files are read from:

{result_dir}/{model_name}/eval_results/{dataset_name}/results.json (or results.pt)

where result_dir defaults to SCENECOT_EVAL_ROOT.

📝 TODO List

BibTex

If you find our work helpful, please consider citing us:

@inproceedings{linghu2026scenecot,
  title={SceneCOT: Eliciting Grounded Chain-of-Thought Reasoning in 3D Scenes},
  author={Linghu, Xiongkun and Huang, Jiangyong and Zhu, Ziyu and Jia, Baoxiong and Huang, Siyuan},
  booktitle={Proceedings of the International Conference on Learning Representations (ICLR)},
  year={2026}
}