README.md

May 12, 2026 ยท View on GitHub

Biased2Grounded

Authors: Na Min An* Inha Kang* Minhyun Lee Hyunjung Shim

B2G Overview


Overview

Vision-Language Encoders (VLEs) are widely adopted as the backbone of zero-shot referring image segmentation (RIS), enabling text-guided localization without task-specific training. Yet the conventionally used final-layer multimodal embeddings prioritize global semantic alignment โ€” and the biases latent in mid-layer representations have remained largely unexplored.

Through layer-wise investigation, we surface two coupled consequences of this over-reliance on final-layer features:

  • Vision: embeddings exhibit weak sensitivity to positional cues, limiting spatial grounding.
  • Language: multilingual text embeddings form language-dependent geometric shifts within the shared representation space.

๐Ÿ’ก Contributions

  • Layer-wise bias analysis โ€” reveals that final-layer VLE representations suppress positional sensitivity and exhibit language-dependent geometric drift in multilingual settings.
  • Mid-layer spatial map โ€” identifies a spatial grounding pathway in VLE mid-layers applicable for zero-shot RIS without task-specific training.
  • Mixed-language mid-layer grounding โ€” leveraging cross-lingual representation diversity at mid-layers yields strong spatial grounding gains (+7โ€“8 mIoU) and improved zero-shot retrieval.

Installation

apt-get install -y libgl1

uv pip install spacy git+https://github.com/openai/CLIP.git h5py gem_torch open_clip_torch==2.24.0 pydantic==2.9.1 opencv-python-headless

uv pip install torch==2.6.0+cu118 torchvision==0.21.0+cu118 --extra-index-url https://download.pytorch.org/whl/cu118

cd ./third_party/old_detectron2
uv pip install -e . --no-build-isolation

uv pip install pillow==9.5.0

uv pip install spacy==3.7.6 

python -m spacy download en_core_web_lg

Checkpoint: Mask2Former


Dataset Structure

Biased2Grounded/
โ”œโ”€โ”€ assets/
โ”‚   โ”œโ”€โ”€ multilingual_meta_data/
โ”‚   โ”œโ”€โ”€ refer_data/
โ”‚   โ”œโ”€โ”€ โ”œโ”€โ”€ images/
โ”‚   โ”œโ”€โ”€ โ”œโ”€โ”€ refcoco/
โ”‚   โ”œโ”€โ”€ โ”œโ”€โ”€ refcoco+/
โ”‚   โ”œโ”€โ”€ โ”œโ”€โ”€ refcocog/

Experiments

# Raw Spatial Maps for CLIP, SigLIP, and SigLIP-2
notebooks/*.ipynb

# run spatial grounding evaluation
bash infer.sh
bash infer_baseline.sh 
bash infer_pmaponly.sh # p-map only ablation

# run multilingual evaluation
bash run.sh
bash run_baseline.sh
bash run_comparison.sh # final-layer centroid

Citation

If you use this work in your research, please cite:

@misc{an2026blindpositionbiasedlanguage,
      title={Blind to Position, Biased in Language: Probing Mid-Layer Representational Bias in Vision-Language Encoders for Zero-Shot Language-Grounded Spatial Understanding}, 
      author={Na Min An and Inha Kang and Minhyun Lee and Hyunjung Shim},
      year={2026},
      eprint={2509.23098},
      archivePrefix={arXiv},
      primaryClass={cs.CV},
      url={https://arxiv.org/abs/2509.23098}, 
}