README.md
May 12, 2026 ยท View on GitHub
Overview
Vision-Language Encoders (VLEs) are widely adopted as the backbone of zero-shot referring image segmentation (RIS), enabling text-guided localization without task-specific training. Yet the conventionally used final-layer multimodal embeddings prioritize global semantic alignment โ and the biases latent in mid-layer representations have remained largely unexplored.
Through layer-wise investigation, we surface two coupled consequences of this over-reliance on final-layer features:
- Vision: embeddings exhibit weak sensitivity to positional cues, limiting spatial grounding.
- Language: multilingual text embeddings form language-dependent geometric shifts within the shared representation space.
๐ก Contributions
- Layer-wise bias analysis โ reveals that final-layer VLE representations suppress positional sensitivity and exhibit language-dependent geometric drift in multilingual settings.
- Mid-layer spatial map โ identifies a spatial grounding pathway in VLE mid-layers applicable for zero-shot RIS without task-specific training.
- Mixed-language mid-layer grounding โ leveraging cross-lingual representation diversity at mid-layers yields strong spatial grounding gains (+7โ8 mIoU) and improved zero-shot retrieval.
Installation
apt-get install -y libgl1
uv pip install spacy git+https://github.com/openai/CLIP.git h5py gem_torch open_clip_torch==2.24.0 pydantic==2.9.1 opencv-python-headless
uv pip install torch==2.6.0+cu118 torchvision==0.21.0+cu118 --extra-index-url https://download.pytorch.org/whl/cu118
cd ./third_party/old_detectron2
uv pip install -e . --no-build-isolation
uv pip install pillow==9.5.0
uv pip install spacy==3.7.6
python -m spacy download en_core_web_lg
Checkpoint: Mask2Former
Dataset Structure
Biased2Grounded/
โโโ assets/
โ โโโ multilingual_meta_data/
โ โโโ refer_data/
โ โโโ โโโ images/
โ โโโ โโโ refcoco/
โ โโโ โโโ refcoco+/
โ โโโ โโโ refcocog/
Experiments
# Raw Spatial Maps for CLIP, SigLIP, and SigLIP-2
notebooks/*.ipynb
# run spatial grounding evaluation
bash infer.sh
bash infer_baseline.sh
bash infer_pmaponly.sh # p-map only ablation
# run multilingual evaluation
bash run.sh
bash run_baseline.sh
bash run_comparison.sh # final-layer centroid
Citation
If you use this work in your research, please cite:
@misc{an2026blindpositionbiasedlanguage,
title={Blind to Position, Biased in Language: Probing Mid-Layer Representational Bias in Vision-Language Encoders for Zero-Shot Language-Grounded Spatial Understanding},
author={Na Min An and Inha Kang and Minhyun Lee and Hyunjung Shim},
year={2026},
eprint={2509.23098},
archivePrefix={arXiv},
primaryClass={cs.CV},
url={https://arxiv.org/abs/2509.23098},
}
