README.md

March 22, 2026 ยท View on GitHub

SceneCOT: Eliciting Grounded Chain-of-Thought Reasoning in 3D Scenes

ICLR 2026

Xiongkun Linghu, Jiangyong Huang, Ziyu Zhu, Baoxiong Jia, Siyuan Huang
ย 
Paper arXiv Project Page Data Model
ย 
SceneCOT Teaser
SceneCOT: We propose a Chain-of-Thought reasoning method in 3D scenes (SceneCOT), decoupling a complex reasoning task into simpler and manageable problems, and building corresponding visual clues based on multimodal expert modules. To our knowledge, this is the first attempt to successfully implement the COT technique for achieving human-like step-by-step reasoning for 3D scene understanding, where we show great potential in extending it to a wider range of 3D scene understanding scenarios.

SceneCOT Framework

LEO Teaser
SceneCOT achieves great performance on MSQA, and Beacon3D, demonstrating the effectiveness of our reasoning framework. Especially, our method significanlty enhances the performance on counting, the most challenging task in MSQA. Our method also significanlty outperforms previous methods by a large margin in Beacon3D.

๐Ÿ”ฅ News

  • [2026-3] Evaluation code, model checkpoints, detailed installation instruction have been released
  • [2026-3] We release training code
  • [2026-1] SceneCOT is accepted by ICLR 2026
  • [2025-6] We released the webpage of SceneCOT

๐Ÿš€ Get Started

  1. Clone the repository.
git clone https://github.com/SceneCOT/scenecot
cd scenecot
  1. Create a Python environment and install dependencies.
conda create -n scenecot python=3.9
conda activate scenecot

# PyTorch (example tested version)
conda install pytorch==2.4.1 torchvision==0.19.1 torchaudio==2.4.1 pytorch-cuda=11.8 -c pytorch -c nvidia

# project dependencies
pip install -r requirements.txt
  1. Install point-cloud third-party modules.
pip install spconv-cu118

cd model/pointnetpp
python setup.py install
cd ../..

# sanity check
python -c 'from model.pointnetpp.pointnetpp import PointNetPP'

If PointNext build/import fails, either disable PointNext usage or place the compiled file from LEO_data under model/pointnext/cpp/pointnet2_batch/.

๐Ÿ”ง Reproducibility configuration

The configs were updated to avoid machine-specific absolute paths. We recommend setting the following environment variables:

VariablePurposeDefaultDownload / Source Link
SCENECOT_EXP_ROOTexperiment output root (cfg.base_dir)./outputs-
SCENECOT_DATA_ROOTroot directory for dataset/assets used by configs/data/default.yaml./data_assetsSceneCOT dataset
SCENECOT_COT_DATA_ROOTroot directory for released COT annotations (MSQA/, GQA3D/)${SCENECOT_DATA_ROOT}/scenecot_cot_dataSceneCOT dataset / scenecot_cot_data
SCENECOT_MSR3D_ANNO_DIRMSQA annotation directory (contains situated_qa_{train,val,test}_pure_txt.json)${SCENECOT_COT_DATA_ROOT}/MSQAMSQA
SCENECOT_GQA3D_ANNO_DIRGQA3D annotation directory (contains gqa3d_{train,val,test}.json)${SCENECOT_COT_DATA_ROOT}/GQA3DGQA3D
HF_HOMEHugging Face cache root (cfg.hf_home)./.cache/huggingfaceHugging Face Hub
SCENECOT_MODEL_ROOTunified root directory for default model/checkpoint paths./model_assetsSceneCOT models
SCENECOT_LLM_PATHLLaVA model path (override)${SCENECOT_MODEL_ROOT}/llava-v1.5-7bLLaVA-1.5-7B
SCENECOT_VISION_TOWER_PATHCLIP vision tower path (override)${SCENECOT_MODEL_ROOT}/clip-vit-large-patch14-336CLIP ViT-L/14-336
SCENECOT_PQ3D_TOKENIZER_PATHPQ3D text tokenizer path (override, data.pq3d_tokenizer_path)${SCENECOT_MODEL_ROOT}/clip-vit-large-patch14SceneCOT models
SCENECOT_POINTNET_TOKENIZER_PATHPQ3D PointNet++ tokenizer checkpoint (override)${SCENECOT_MODEL_ROOT}/pointnet_tokenizer.pthSceneCOT models
SCENECOT_QUERY3D_PRETRAIN_PATHPQ3D/SceneVerse pretrain checkpoint (override)${SCENECOT_MODEL_ROOT}/query3d_pretrain.binSceneCOT models
SCENECOT_EXPERT1_PATHMOE expert-1 checkpoint directory (override)${SCENECOT_MODEL_ROOT}/expert1_checkpoint0SceneCOT model repo (checkpoint dirs)
SCENECOT_EXPERT2_PATHMOE expert-2 checkpoint directory (override)${SCENECOT_MODEL_ROOT}/expert2_best.pthSceneCOT model repo (checkpoint dirs)

Example:

export SCENECOT_EXP_ROOT=/path/to/experiments
export SCENECOT_DATA_ROOT=/path/to/data_assets
export SCENECOT_COT_DATA_ROOT=/path/to/data_assets/scenecot_cot_data
export SCENECOT_MSR3D_ANNO_DIR=/path/to/data_assets/scenecot_cot_data/MSQA
export SCENECOT_GQA3D_ANNO_DIR=/path/to/data_assets/scenecot_cot_data/GQA3D
export HF_HOME=/path/to/hf_cache
export SCENECOT_MODEL_ROOT=/path/to/model_assets

# Optional explicit overrides when using non-default file names/locations
# export SCENECOT_LLM_PATH=/path/to/model_assets/llava-v1.5-7b
# export SCENECOT_VISION_TOWER_PATH=/path/to/model_assets/clip-vit-large-patch14-336
# export SCENECOT_PQ3D_TOKENIZER_PATH=/path/to/model_assets/clip-vit-large-patch14
# export SCENECOT_EXPERT1_PATH=/path/to/model_assets/expert1_checkpoint0
# export SCENECOT_EXPERT2_PATH=/path/to/model_assets/expert2_best.pth

๐Ÿ“ฆ Pretrained weights

To reproduce paper-level performance, the following checkpoints are needed:

  1. SceneCOT experts (released): SceneCOT model repo
  2. PQ3D PointNet++ tokenizer (pointnet_tokenizer.pth) โ†’ set SCENECOT_POINTNET_TOKENIZER_PATH
  3. Query3D/SceneVerse pretrain (pytorch_model.bin) โ†’ set SCENECOT_QUERY3D_PRETRAIN_PATH

For MOE evaluation, expert checkpoints are expected as directories under SCENECOT_MODEL_ROOT:

${SCENECOT_MODEL_ROOT}/
โ”œโ”€โ”€ expert1_checkpoint0/
โ”‚   โ””โ”€โ”€ pytorch_model.bin (or model.safetensors)
โ””โ”€โ”€ expert2_best.pth/
  โ””โ”€โ”€ pytorch_model.bin (or model.safetensors)

These map to:

  • moe.expert1_path โ†’ ${SCENECOT_MODEL_ROOT}/expert1_checkpoint0 (or SCENECOT_EXPERT1_PATH)
  • moe.expert2_path โ†’ ${SCENECOT_MODEL_ROOT}/expert2_best.pth (or SCENECOT_EXPERT2_PATH)

By default, 2/3 are resolved under SCENECOT_MODEL_ROOT. If files are absent, related modules are initialized without those pretrained weights, which may significantly affect final metrics.

๐ŸŒ External services

Weights & Biases

Tracking is enabled by default. For evaluation-only/offline runs without login:

export WANDB_MODE=disabled

Hugging Face access

If direct access to huggingface.co is restricted, set a mirror endpoint and keep a local cache:

export HF_ENDPOINT=https://your-hf-mirror
export HF_HOME=/path/to/hf_cache

๐Ÿ“ Data preparation

  1. Download released dataset assets from SceneCOT dataset.
  2. Place all downloaded data under one root directory, for example:

/path/to/data_assets

  1. Set:
export SCENECOT_DATA_ROOT=/path/to/data_assets
  1. configs/data/default.yaml resolves paths from SCENECOT_DATA_ROOT as:
  • ${SCENECOT_DATA_ROOT}/SceneVerse โ†’ data.sceneverse_base
  • ${SCENECOT_DATA_ROOT}/leo2-cot โ†’ data.cot_annotation_base
  • ${SCENECOT_DATA_ROOT}/scan_family โ†’ data.scan_family_base
  • ${SCENECOT_DATA_ROOT}/LEO-2_feature/ScanNet โ†’ data.obj_feat_2d_base.ScanNet
  • ${SCENECOT_DATA_ROOT}/scene-verse-pred-all/ScanNet โ†’ data.obj_feat_base.ScanNet
  • ${SCENECOT_DATA_ROOT}/scenecot_imgs/imgs/scannet โ†’ data.obj_img_base.ScanNet
  1. COT annotation paths are resolved clearly as:
  • ${SCENECOT_COT_DATA_ROOT}/MSQA (or SCENECOT_MSR3D_ANNO_DIR) โ†’ data.msr3d_anno_dir, data.cotqa.msr3d.anno_dir
  • ${SCENECOT_COT_DATA_ROOT}/GQA3D (or SCENECOT_GQA3D_ANNO_DIR) โ†’ data.gqa3d_anno_dir, data.cotqa.gqa3d.anno_dir

Expected folder layout:

${SCENECOT_COT_DATA_ROOT}/
โ”œโ”€โ”€ MSQA/
โ”‚   โ”œโ”€โ”€ situated_qa_train_pure_txt.json
โ”‚   โ”œโ”€โ”€ situated_qa_val_pure_txt.json
โ”‚   โ””โ”€โ”€ situated_qa_test_pure_txt.json
โ””โ”€โ”€ GQA3D/
    โ”œโ”€โ”€ gqa3d_train.json
    โ”œโ”€โ”€ gqa3d_val.json
    โ””โ”€โ”€ gqa3d_test.json
  1. Download released checkpoints from SceneCOT models, and set optional PQ3D checkpoint envs if available.

๐Ÿ•น Training and evaluation

Training:

sh scripts/train/full_training_msqa_gqa3d.sh

Evaluation (MOE test script):

sh scripts/test/full_training_msqa_beacon3d_test_moe.sh

๐Ÿ“Š Offline evaluation

  1. Download evaluation_assets from HF evaluation assets.
  2. Set optional variables:
export SCENECOT_EVAL_ASSETS=/path/to/evaluation_assets
export SCENECOT_EVAL_ROOT=/path/to/experiments
  1. Run:
python evaluator/msqa_evaluator_offline.py

Expected prediction files are read from:

{result_dir}/{model_name}/eval_results/{dataset_name}/results.json (or results.pt)

where result_dir defaults to SCENECOT_EVAL_ROOT.

๐Ÿ“ TODO List

  • Arxiv paper
  • Evaluation code
  • Training code
  • Model weights
  • SceneCOT-185K dataset

BibTex

If you find our work helpful, please consider citing us:

@inproceedings{linghu2026scenecot,
  title={SceneCOT: Eliciting Grounded Chain-of-Thought Reasoning in 3D Scenes},
  author={Linghu, Xiongkun and Huang, Jiangyong and Zhu, Ziyu and Jia, Baoxiong and Huang, Siyuan},
  booktitle={Proceedings of the International Conference on Learning Representations (ICLR)},
  year={2026}
}