ObjEmbed: Towards Universal Multimodal Object Embeddings
May 18, 2026 Β· View on GitHub
This is the official PyTorch implementation of ObjEmbed. Our paper can be found at here.
If you find our work helpful, please kindly give us a star π
Here is the δΈζηζε.
π ObjEmbed Overview
Aligning objects with corresponding textual descriptions is a fundamental challenge and a realistic requirement in vision-language understanding. While recent multimodal embedding models excel at global image-text alignment, they often struggle with fine-grained alignment between image regions and specific phrases.
In this work, we present ObjEmbed, a novel MLLM embedding model that decomposes the input image into multiple regional embeddings, each corresponding to an individual object, along with global embeddings. It supports a wide range of visual understanding tasks like visual grounding, local image retrieval, and global image retrieval.
ObjEmbed enjoys three key properties:
- Object-Oriented Representation: It captures both semantic and spatial aspects of objects by generating two complementary embeddings for each region: an object embedding for semantic matching and an IoU embedding that predicts localization quality. The final object matching score combines semantic similarity with the predicted IoU, enabling more accurate retrieval.
- Versatility: It seamlessly handles both region-level and image-level tasks.
- Efficient Encoding: All objects in an image, along with the full image, are encoded in a single forward pass for high efficiency.
Superior performance on 18 diverse benchmarks demonstrates its strong semantic discrimination.
π₯ Update
- [2026.5.18] Release the training data.
- [2026.5.1] Our paper was accepted by ICML2026.
- [2026.2.3] Release the code and paper.
π Experimental Results
π Model Zoo
We use WeDetect-Base-Uni as the proposal network. You can download the checkpoint at huggingface:
π Results
π§ Install
Our environment
pytorch==2.6.1+cu124
transformers==4.57.1
trl==0.17.0
accelerate==1.10.0
- Install the environment as follows.
pip install transformers==4.57.1 trl==0.17.0 accelerate==1.10.0 -i https://mirrors.cloud.tencent.com/pypi/simple
pip install pycocotools terminaltables jsonlines tabulate ddd-dataset torchmetrics lvis -i https://mirrors.cloud.tencent.com/pypi/simple
- Evaluating on LVIS should make sure
numpy<=1.24.
β Demo
π Referring Expression Comprehension
# output the top1 prediction
python infer_objembed.py --objembed_checkpoint /PATH/TO/OBJEMBED --wedetect_uni_checkpoint /PATH/TO/WEDETECT_UNI --image assets/demo.jpg --query "The car's license plate in HAWAII" --task rec --visualize
π Image Retrieval
python infer_objembed.py --objembed_checkpoint /PATH/TO/OBJEMBED --wedetect_uni_checkpoint /PATH/TO/WEDETECT_UNI --image image1.jpg image2.jpg image3.jpg --query "YOUR_QUERY" --task retrieval_by_image
π Evaluation
π Evaluation Dataset Preparation
You can download the datasets from the following links:
- COCO: https://cocodataset.org/#home
- LVIS: https://www.lvisdataset.org/
- COCO-O: https://github.com/alibaba/easyrobust/tree/main/benchmarks/coco_o
- odinw13: https://huggingface.co/GLIPModel/GLIP/tree/main/odinw_35
- fg-ovd: https://github.com/lorebianchi98/FG-OVD/tree/main/benchmarks
- d3: https://github.com/shikras/d-cube?tab=readme-ov-file#download
- refcoco/+/g: https://huggingface.co/datasets/fushh7/eval_refcoco
- sorce_1k: https://huggingface.co/datasets/lcxrocks/sorce-1k
- reircoco: https://huggingface.co/datasets/haoxiangzhao/REIRCOCO
- ilias: https://huggingface.co/datasets/vrg-prague/ilias
- sharegpt-4v: https://github.com/ShareGPT4Omni/ShareGPT4V/blob/master/docs/Data.md
- dci: https://github.com/facebookresearch/DCI
- coco_caption_2017: https://cocodataset.org/#home
- flickr30k: https://huggingface.co/datasets/nlphuji/flickr30k
- coco_cn: We have provided.
- flickr30k_cn: https://github.com/li-xirong/cross-lingual-cap
π Visual Grounding
cd eval_grounding
export PYTHONPATH=../
# coco / coco_o / lvis / FG-OVD / d3 / odinw13
torchrun --nproc-per-node=8 --nnodes=1 --node_rank=0 --master_addr="127.0.0.1" --master_port=29500 eval.py --checkpoint /PATH/TO/OBJEMBED --dataset coco --nms --task_specific_visual_prompt
# refcoco
torchrun --nproc-per-node=8 --nnodes=1 --node_rank=0 --master_addr="127.0.0.1" --master_port=29500 eval.py --checkpoint /PATH/TO/OBJEMBED --dataset refcoco --num_select 20 --task_specific_visual_prompt
- Please change the dataset path in Line
47-417ofeval_grounding/eval.py. - For each dataset, users should first extract proposals for each image and save them as json files. You can use
generate_proposal.pyas an example code. We provide refcoco proposals at here.
π Image Retrieval
cd eval_retrieval
export PYTHONPATH=../
# sharegpt4v / dci / coco / coco_cn / d3 / flickr30k / flickr30k_cn
# sorce_1k / reircoco / ilias / ilias_i2i
torchrun --nproc-per-node=8 --nnodes=1 --node_rank=0 --master_addr="127.0.0.1" --master_port=29500 eval.py --checkpoint /PATH/TO/OBJEMBED --dataset sorce_1k
- Please change the dataset path in Line
19-90ofeval_retrieval/eval.py. - For each dataset, users should first extract proposals for each image and save them as json files. You can use
generate_proposal.pyas an example code. We provide refcoco proposals at here.
π Training
ο½--datasets
ο½ |--coco
ο½ | |--annotations
ο½ | |--train2017
ο½ | |--val2017
ο½ | |--train2014
ο½ | |--val2014
ο½ |--GQA
ο½ | |--images
ο½ |--V3Det
ο½ | |--train
ο½ | | |--images
ο½ | | | |--a00013718
ο½ |--OpenImagesV6
ο½ | |--train_image
ο½ |--Ref-CoT-45k
ο½ | |--images
ο½ |--Object365
ο½ | |--train
ο½ | | |--patch1
ο½ | | |--patch2
ο½ |--Ref-L4
ο½ | |--images
ο½ |--sa_1b
ο½ | |--sa_000112
ο½ | |--sa_000124
ο½ |--ObjEmbed_training_data
ο½ | |--annotations
ο½ | |--proposals
- Please organize the datasets according to the directory structure outlined above. The datasets can be downloaded from the following links:
- COCO: https://cocodataset.org/#home
- GQA: https://downloads.cs.stanford.edu/nlp/data/gqa/images.zip
- V3Det: https://huggingface.co/datasets/yhcao/V3Det_Backup
- OpenImagesV6: https://huggingface.co/datasets/nvidia/describe-anything-dataset/tree/main/OpenImages/images (You can only download the subset of images from this link)
- Ref-CoT-45k: https://huggingface.co/datasets/IDEA-Research/HumanRef-CoT-45k
- Object365: https://huggingface.co/datasets/guozonghao96/objects365
- Ref-L4: https://huggingface.co/datasets/JierunChen/Ref-L4
- SA-1B: https://huggingface.co/datasets/Aber-r/SA-1B_backup (We only use images from the following tar files: 'sa_000112.tar', 'sa_000124.tar', 'sa_000130.tar', 'sa_000149.tar', 'sa_000155.tar', 'sa_000165.tar', 'sa_000166.tar', 'sa_000167.tar', 'sa_000172.tar', 'sa_000180.tar', 'sa_000189.tar', 'sa_000199.tar', 'sa_000221.tar', 'sa_000233.tar', 'sa_000288.tar', 'sa_000310.tar', 'sa_000316.tar', 'sa_000331.tar', 'sa_000339.tar', 'sa_000344.tar', 'sa_000348.tar', 'sa_000350.tar', 'sa_000352.tar', 'sa_000386.tar', 'sa_000410.tar', 'sa_000414.tar', 'sa_000417.tar', 'sa_000420.tar')
- ObjEmbed_training_data: https://huggingface.co/datasets/fushh7/ObjEmbed_training_data (This dataset contains our collected annotations and object proposals.)
- Due to licensing restrictions, we are unable to release the 200k self-crawled images. Instead, we provide 500k annotations on the Object365 dataset, which yield comparable performance.
- Our models are finetuned from the stage-2 checkpoints of WeDetect-Ref. Please download them from the following links:
- WeDetect-Ref-2B-stage2: https://huggingface.co/fushh7/WeDetect-Ref-2B-stage2
- WeDetect-Ref-4B-stage2: https://huggingface.co/fushh7/WeDetect-Ref-4B-stage2
- Training commands:
bash scripts/train.sh- Please update the dataset and checkpoint paths in the script accordingly.
- Training is conducted on 16 or 32 GPUs. GPUs with at least 40GB memory are recommended.
π Acknowledgement
- ObjEmbed is based on many outstanding open-sourced projects, including WeDetect, transformers, Qwen3-VL and many others. Thank the authors of above projects for open-sourcing their assets!
βοΈ Citation
If you find our work helpful for your research, please consider citing our work.
@article{fu2026objembed,
title={ObjEmbed: Towards Universal Multimodal Object Embeddings},
author={Fu, Shenghao and Su, Yukun and Rao, Fengyun and LYU, Jing and Xie, Xiaohua and Zheng, Wei-Shi},
journal={arXiv preprint arXiv:2602.01753},
year={2026}
}
π License
- Our models and code are under the Apache 2.0 License.