X-Decoder: Generalized Decoding for Pixel, Image, and Language

October 5, 2023 · View on GitHub

[Project Page] [Paper] [HuggingFace All-in-One Demo] [HuggingFace Instruct Demo] [Video]

by Xueyan Zou*, Zi-Yi Dou*, Jianwei Yang*, Zhe Gan, Linjie Li, Chunyuan Li, Xiyang Dai, Harkirat Behl, Jianfeng Wang, Lu Yuan, Nanyun Peng, Lijuan Wang, Yong Jae Lee^, Jianfeng Gao^ in CVPR 2023.

:hot_pepper: Getting Started

We release the following contents for both SEEM and X-Decoder:exclamation:

  • Demo Code
  • Model Checkpoint
  • Comprehensive User Guide
  • Training Code
  • Evaluation Code

:point_right: One-Line SEEM Demo with Linux:

git clone git@github.com:UX-Decoder/Segment-Everything-Everywhere-All-At-Once.git && sh aasets/scripts/run_demo.sh

:round_pushpin: [New] Getting Started:

:round_pushpin: [New] Latest Checkpoints and Numbers:

COCORef-COCOgVOCSBD
MethodCheckpointbackbonePQ ↑mAP ↑mIoU ↑cIoU ↑mIoU ↑AP50 ↑NoC85 ↓NoC90 ↓NoC85 ↓NoC90 ↓
X-DecoderckptFocal-T50.839.562.457.663.271.6----
X-Decoder-oq201ckptFocal-L56.546.767.262.867.576.3----
SEEM_v0ckptFocal-T50.639.460.958.563.571.63.544.59**
SEEM_v0-Davit-d356.246.865.363.268.376.62.993.895.939.23
SEEM_v0ckptFocal-L56.246.465.562.867.776.23.043.85**
SEEM_v1ckptFocal-T50.839.460.758.563.772.03.194.13**
SEEM_v1ckptSAM-ViT-B52.043.560.254.162.269.32.533.23**
SEEM_v1ckptSAM-ViT-L49.041.658.253.862.269.52.402.96**

SEEM_v0: Supporting Single Interactive object training and inference
SEEM_v1: Supporting Multiple Interactive objects training and inference

:fire: News

  • [2023.10.04] We are excited to release :white_check_mark: training/evaluation/demo code, :white_check_mark: new checkpoints, and :white_check_mark: comprehensive readmes for both X-Decoder and SEEM!
  • [2023.09.24] We are providing new demo command/code for inference (DEMO.md)!
  • [2023.07.19] :roller_coaster: We are excited to release the x-decoder training code (INSTALL.md, DATASET.md, TRAIN.md, EVALUATION.md)!
  • [2023.07.10] We release Semantic-SAM, a universal image segmentation model to enable segment and recognize anything at any desired granularity. Code and checkpoint are available!
  • [2023.04.14] We are releasing SEEM, a new universal interactive interface for image segmentation! You can use it for any segmentation tasks, way beyond what X-Decoder can do!

  • [2023.03.20] As an aspiration of our X-Decoder, we developed OpenSeeD ([Paper][Code]) to enable open-vocabulary segmentation and detection with a single model, Check it out!
  • [2023.03.14] We release X-GPT which is an conversational version of our X-Decoder through GPT-3 langchain!
  • [2023.03.01] The Segmentation in the Wild Challenge had been launched and ready for submitting results!
  • [2023.02.28] We released the SGinW benchmark for our challenge. Welcome to build your own models on the benchmark!
  • [2023.02.27] Our X-Decoder has been accepted by CVPR 2023!
  • [2023.02.07] We combine X-Decoder (strong image understanding), GPT-3 (strong language understanding) and Stable Diffusion (strong image generation) to make an instructional image editing demo, check it out!
  • [2022.12.21] We release inference code of X-Decoder.
  • [2022.12.21] We release Focal-T pretrained checkpoint.
  • [2022.12.21] We release open-vocabulary segmentation benchmark.

:paintbrush: DEMO

:blueberries: [X-GPT]   :strawberry:[Instruct X-Decoder]

demo

:notes: Introduction

github_figure

X-Decoder is a generalized decoding model that can generate pixel-level segmentation and token-level texts seamlessly!

It achieves:

  • State-of-the-art results on open-vocabulary segmentation and referring segmentation on eight datasets;
  • Better or competitive finetuned performance to generalist and specialist models on segmentation and VL tasks;
  • Friendly for efficient finetuning and flexible for novel task composition.

It supports:

  • One suite of parameters pretrained for Semantic/Instance/Panoptic Segmentation, Referring Segmentation, Image Captioning, and Image-Text Retrieval;
  • One model architecture finetuned for Semantic/Instance/Panoptic Segmentation, Referring Segmentation, Image Captioning, Image-Text Retrieval and Visual Question Answering (with an extra cls head);
  • Zero-shot task composition for Region Retrieval, Referring Captioning, Image Editing.

Acknowledgement

  • We appreciate the contructive dicussion with Haotian Zhang
  • We build our work on top of Mask2Former
  • We build our demos on HuggingFace :hugs: with sponsored GPUs
  • We appreciate the discussion with Xiaoyu Xiang during rebuttal

Citation

@article{zou2022xdecoder,
  author      = {Zou*, Xueyan and Dou*, Zi-Yi and Yang*, Jianwei and Gan, Zhe and Li, Linjie and Li, Chunyuan and Dai, Xiyang and Wang, Jianfeng and Yuan, Lu and Peng, Nanyun and Wang, Lijuan and Lee*, Yong Jae and Gao*, Jianfeng},
  title       = {Generalized Decoding for Pixel, Image and Language},
  publisher   = {arXiv},
  year        = {2022},
}