CoReS: Orchestrating the Dance of Reasoning and Segmentation

March 10, 2025 · View on GitHub

CORES: Orchestrating the Dance of Reasoning and Segmentation

<a href=https://arxiv.org/abs/2404.05673>**Paper** | Project

News

[2025.3] Training code is released!
[2024.7] Paper accepted by ECCV2024!
[2024.4] Paper is released and GitHub repo is created.

CoReS: Orchestrating the Dance of Reasoning and Segmentation [Paper]

The reasoning segmentation task, which demands a nuanced comprehension of intricate queries to accurately pinpoint object regions, is attracting increasing attention. However, Multi-modal Large Language Models (MLLM) often find it difficult to accurately localize the objects described in complex reasoning contexts. We believe that the act of reasoning segmentation should mirror the cognitive stages of human visual search, where each step is a progressive refinement of thought toward the final object. Thus we introduce the Chains of Reasoning and Segmenting (CoReS) and find this top-down visual hierarchy indeed enhances the visual search process. Specifically, we propose a dual-chain structure that generates multi-modal, chain-like outputs to aid the segmentation process. Furthermore, to steer the MLLM's outputs into this intended hierarchy, we incorporate in-context inputs as guidance. Extensive experiments demonstrate the superior performance of our CoReS, which surpasses the state-of-the-art method by 6.5% on the ReasonSeg dataset. For more details, please refer to the paper.

Installation

pip install -r requirements.txt
pip install flash-attn --no-build-isolation

Training

Training Data Preparation

The training data consists of 4 types of data:

Semantic segmentation datasets: ADE20K, COCO-Stuff, Mapillary, PACO-LVIS, PASCAL-Part, COCO Images

Note: For COCO-Stuff, we use the annotation file stuffthingmaps_trainval2017.zip. We only use the PACO-LVIS part in PACO. COCO Images should be put into the dataset/coco/ directory.
Referring segmentation datasets: refCOCO, refCOCO+, refCOCOg, refCLEF (saiapr_tc-12)

Note: the original links of refCOCO series data are down, and we update them with new ones. If the download speed is super slow or unstable, we also provide a OneDrive link to download. You must also follow the rules that the original datasets require.
Visual Question Answering dataset: LLaVA-Instruct-150k
Reasoning segmentation dataset: ReasonSeg

Download them from the above links, and organize them as in LISA.

Pre-trained weights

LLaVA

To train CORES-7B or 13B, you need to follow the instruction to merge the LLaVA delta weights. Typically, we use the final weights LLaVA-Lightning-7B-v1-1 and LLaVA-13B-v1-1 merged from liuhaotian/LLaVA-Lightning-7B-delta-v1-1 and liuhaotian/LLaVA-13b-delta-v1-1, respectively. For Llama2, we can directly use the LLaVA full weights liuhaotian/llava-llama-2-13b-chat-lightning-preview.

SAM ViT-H weights

Download SAM ViT-H pre-trained weights from the link.

Training

deepspeed --master_port=24999 train_ds_best.py \
  --version="PATH_TO_LLaVA" \
  --dataset_dir='./dataset' \
  --vision_pretrained="PATH_TO_SAM" \
  --dataset="sem_seg||refer_seg||vqa||reason_seg" \
  --sample_rates="9,3,3,1" \
  --exp_name="cores-7b"

When training is finished, to get the full model weight:

cd ./runs/cores-7b/ckpt_model && python zero_to_fp32.py . ../pytorch_model.bin

Merge LoRA Weight

Merge the LoRA weights of pytorch_model.bin, save the resulting model into your desired path in the Hugging Face format:

CUDA_VISIBLE_DEVICES="" python merge_lora_weights_and_save_hf_model.py \
  --version="PATH_TO_LLaVA" \
  --weight="PATH_TO_pytorch_model.bin" \
  --save_path="PATH_TO_SAVED_MODEL"

For example:

CUDA_VISIBLE_DEVICES="" python3 merge_lora_weights_and_save_hf_model.py \
  --version="Your_Path/model/LLaVA-7B-v0" --lora_r=8\
  --weight="runs/cores-7b/ckpt_model/pytorch_model.bin" \
  --save_path="YOUR_PATH_FOR_CKPT/ckpt/cores7bft"

Validation

deepspeed --master_port=10289 train_ds_best.py \
  --version=YOUR_PATH_FOR_CKPT/ckpt/cores7bft \
  --dataset_dir=./dataset \
  --vision_pretrained=YOUR_PATH_FOR_CKPT/ckpt/sam_vit_h_4b8939.pth \
  --dataset="reason_seg" \
  --val_dataset="ReasonSeg|testshort" \
  --sample_rates="1" 
  --exp_name="reasonsegtest" \
  --sample_rates="1" \
  --eval_only

Citation

If you find this project useful in your research, please consider citing:

@inproceedings{bao2024cores,
  title={Cores: Orchestrating the dance of reasoning and segmentation},
  author={Bao, Xiaoyi and Sun, Siyang and Ma, Shuailei and Zheng, Kecheng and Guo, Yuxin and Zhao, Guosheng and Zheng, Yun and Wang, Xingang},
  booktitle={European Conference on Computer Vision},
  pages={187--204},
  year={2024},
  organization={Springer}
}

Acknowledgement

This work is built upon the LISA LLaVA and SAM.