CoReS: Orchestrating the Dance of Reasoning and Segmentation
March 10, 2025 ยท View on GitHub
News
- [2025.3] Training code is released!
- [2024.7] Paper accepted by ECCV2024!
- [2024.4] Paper is released and GitHub repo is created.
CoReS: Orchestrating the Dance of Reasoning and Segmentation [Paper]
Abstract
The reasoning segmentation task, which demands a nuanced comprehension of intricate queries to accurately pinpoint object regions, is attracting increasing attention. However, Multi-modal Large Language Models (MLLM) often find it difficult to accurately localize the objects described in complex reasoning contexts. We believe that the act of reasoning segmentation should mirror the cognitive stages of human visual search, where each step is a progressive refinement of thought toward the final object. Thus we introduce the Chains of Reasoning and Segmenting (CoReS) and find this top-down visual hierarchy indeed enhances the visual search process. Specifically, we propose a dual-chain structure that generates multi-modal, chain-like outputs to aid the segmentation process. Furthermore, to steer the MLLM's outputs into this intended hierarchy, we incorporate in-context inputs as guidance. Extensive experiments demonstrate the superior performance of our CoReS, which surpasses the state-of-the-art method by 6.5% on the ReasonSeg dataset. For more details, please refer to the paper.
Installation
pip install -r requirements.txt
pip install flash-attn --no-build-isolation
Training
Training Data Preparation
The training data consists of 4 types of data:
-
Semantic segmentation datasets: ADE20K, COCO-Stuff, Mapillary, PACO-LVIS, PASCAL-Part, COCO Images
Note: For COCO-Stuff, we use the annotation file stuffthingmaps_trainval2017.zip. We only use the PACO-LVIS part in PACO. COCO Images should be put into the
dataset/coco/directory. -
Referring segmentation datasets: refCOCO, refCOCO+, refCOCOg, refCLEF (saiapr_tc-12)
Note: the original links of refCOCO series data are down, and we update them with new ones. If the download speed is super slow or unstable, we also provide a OneDrive link to download. You must also follow the rules that the original datasets require.
-
Visual Question Answering dataset: LLaVA-Instruct-150k
-
Reasoning segmentation dataset: ReasonSeg
Download them from the above links, and organize them as in LISA.
Pre-trained weights
LLaVA
To train CORES-7B or 13B, you need to follow the instruction to merge the LLaVA delta weights. Typically, we use the final weights LLaVA-Lightning-7B-v1-1 and LLaVA-13B-v1-1 merged from liuhaotian/LLaVA-Lightning-7B-delta-v1-1 and liuhaotian/LLaVA-13b-delta-v1-1, respectively. For Llama2, we can directly use the LLaVA full weights liuhaotian/llava-llama-2-13b-chat-lightning-preview.
SAM ViT-H weights
Download SAM ViT-H pre-trained weights from the link.
Training
deepspeed --master_port=24999 train_ds_best.py \
--version="PATH_TO_LLaVA" \
--dataset_dir='./dataset' \
--vision_pretrained="PATH_TO_SAM" \
--dataset="sem_seg||refer_seg||vqa||reason_seg" \
--sample_rates="9,3,3,1" \
--exp_name="cores-7b"
When training is finished, to get the full model weight:
cd ./runs/cores-7b/ckpt_model && python zero_to_fp32.py . ../pytorch_model.bin
Merge LoRA Weight
Merge the LoRA weights of pytorch_model.bin, save the resulting model into your desired path in the Hugging Face format:
CUDA_VISIBLE_DEVICES="" python merge_lora_weights_and_save_hf_model.py \
--version="PATH_TO_LLaVA" \
--weight="PATH_TO_pytorch_model.bin" \
--save_path="PATH_TO_SAVED_MODEL"
For example:
CUDA_VISIBLE_DEVICES="" python3 merge_lora_weights_and_save_hf_model.py \
--version="Your_Path/model/LLaVA-7B-v0" --lora_r=8\
--weight="runs/cores-7b/ckpt_model/pytorch_model.bin" \
--save_path="YOUR_PATH_FOR_CKPT/ckpt/cores7bft"
Validation
deepspeed --master_port=10289 train_ds_best.py \
--version=YOUR_PATH_FOR_CKPT/ckpt/cores7bft \
--dataset_dir=./dataset \
--vision_pretrained=YOUR_PATH_FOR_CKPT/ckpt/sam_vit_h_4b8939.pth \
--dataset="reason_seg" \
--val_dataset="ReasonSeg|testshort" \
--sample_rates="1"
--exp_name="reasonsegtest" \
--sample_rates="1" \
--eval_only
Citation
If you find this project useful in your research, please consider citing:
@inproceedings{bao2024cores,
title={Cores: Orchestrating the dance of reasoning and segmentation},
author={Bao, Xiaoyi and Sun, Siyang and Ma, Shuailei and Zheng, Kecheng and Guo, Yuxin and Zhao, Guosheng and Zheng, Yun and Wang, Xingang},
booktitle={European Conference on Computer Vision},
pages={187--204},
year={2024},
organization={Springer}
}