The first GPT-style general vision model unifies various vision tasks only with a vanilla ViT. No negative transfer.

October 7, 2024 · View on GitHub

This repo is the official implementation of ECCV2024 Oral paper: GiT: Towards Generalist Vision Transformer through Universal Language Interface as well as the follow-ups. We have made every effort to ensure that the codebase is clean, concise, easily readable, state-of-the-art, and relies only on minimal dependencies.

GiT: Towards Generalist Vision Transformer through Universal Language Interface

Haiyang Wang*, Hao Tang*, Li Jiang $^\dagger$ , Shaoshuai Shi, Muhammad Ferjad Naeem, Hongsheng Li, Bernt Schiele, Liwei Wang $^\dagger$

Primary contact: Haiyang Wang ( wanghaiyang6@stu.pku.edu.cn ), Hao Tang ( tanghao@stu.pku.edu.cn )

📣 News

[24-8-12] 🤗 Our GiT was accepted by ECCV2024 with Oral presentation.
[24-7-01] 🤗 Our GiT was accepted by ECCV2024.
[24-3-15] 🚀 Training and inference Code is released.
[24-3-15] 👀 GiT is released on arXiv.

💫 What we want to do

The Model Architectures across various AI domains are converging towards Multi-Layer Plain Transformers.

Language Modeling (GPT)
2D Image Modeling (ViT)
3D Point Cloud Modeling (DSVT)
2D Image and 3D Point Cloud Joint Modeling (UniTR)
Graph Modeling (Graphormer)
$\cdot \cdot \cdot$

Reducing Human Bias in Model Architecture Designing

We aim to unify the model architecture of vision and language through a plain transformer, reducing human biases such as modality-specific encoders and task-specific heads. A key advancement in deep learning is the shift from hand-crafted to autonomously learned features, inspiring us to reduce human-designed aspects in architecture. Moreover, benefiting from the flexibility of plain transformers, our framework can extend to more modalities like point clouds and graphs.

🤔 What we achieve

Building a universal computation model across all tasks stands as the cornerstone of artificial intelligence, reducing the need for task-specific designs. In this project, we introduce GiT (Generalist Vision Transformer). GiT has the following characteristics:

😮 Minimalist architecture design similar to LLM: GiT consists solely of a single transformer, without the inclusion of additional vision encoders and adapters.
🚀 Covering all types of visual understanding tasks: GiT addresses a spectrum of visual tasks, including object-level tasks (e.g., object detection), pixel-level tasks (e.g., semantic segmentation), and vision-language tasks (e.g., image captioning).
🤗 Achieving multi-task ability by unified language interface: Similar to LLM, GiT observes the task synergy effect in multi-task training. It fosters mutual enhancement across tasks, leading to significant improvements compared to isolated training. No negative transfer phenomenon.
🔥 Strong performance on zero-shot and few-shot benchmark: GiT scales well with model size and data, demonstrating remarkable generalizability across diverse scenarios after training on 27 datasets.
👍 Simple one-stage training strategy: GiT uses a very simple one-stage training strategy, fully embracing the training style utilized by the current LLM framework.

Overview

🚀 Main Results

Single-Task Benchmark

Model	Params	Metric	Perfomance	ckpt	log	config
GiT-B_detection	131M	mAP	45.1	ckpt	log	config
GiT-B_insseg	131M	mAP	31.4	ckpt	log	config
GiT-B_semseg	131M	mIoU	47.7	ckpt	log	config
GiT-B_caption	131M	BLEU-4	33.7	ckpt	log	config
GiT-B_grounding	131M	Acc@0.5	83.3	ckpt	log	config

Multi-Tasking Benchmark

Model	Params	Detection	Ins Seg	Sem Seg	Caption	Grounding	ckpt	log	config
GiT-B_multi-task	131M	46.7	31.9	47.8	35.3	85.8	ckpt	log	config
GiT-L_multi-task	387M	51.3	35.1	50.6	35.7	88.4	ckpt	log	config
GiT-H_multi-task	756M	52.9	35.8	52.4	36.2	89.2	ckpt	log	config

Task Synergy in Multi-Tasking Training

Model	Params	Detection	Ins Seg	Sem Seg	Caption	Grounding
GiT-B_single-task	131M	45.1	31.4	47.7	33.7	83.3
Improvement		+1.6	+0.5	+0.1	+1.6	+2.5
GiT-B_multi-task	131M	46.7	31.9	47.8	35.3	85.8

Zero-shot benchmark

Model	Params	Cityscapes (Det)	Cityscapes (Ins Seg)	Cityscapes (Sem Seg)	SUN RGB-D	nocaps	ckpt	log	config
GiT-B_multi-task	131M	21.8	14.3	34.4	30.9	9.2	ckpt	log	config
GiT-B_universal	131M	29.1	17.9	56.2	37.5	10.6	ckpt	log	config
GiT-L_universal	387M	32.3	20.3	58.0	39.9	11.6	ckpt	log	config
GiT-H_universal	756M	34.1	18.7	61.8	42.5	12.6	ckpt	log	config

Few-shot benchmark

Model	Params	DRIVE	LoveDA	Potsdam	WIDERFace	DeepFashion	config
GiT-B_multi-task	131M	34.3	24.9	19.1	17.4	23.0	config
GiT-B_universal	131M	51.1	30.8	30.6	31.2	38.3	config
GiT-L_universal	387M	55.4	34.1	37.2	33.4	49.3	config
GiT-H_universal	756M	57.9	35.1	43.4	34.0	52.2	config

🛠️ Quick Start

Installation

conda create -n GiT python=3.8

conda activate GiT

# We only test in 1.9.1, may be other versions are also worked.
pip install torch==1.9.1+cu111 torchvision==0.10.1+cu111 torchaudio==0.9.1 -f https://download.pytorch.org/whl/torch_stable.html

pip install -U openmim
mim install "mmengine==0.8.3"
mim install "mmcv==2.0.1"
pip install "transformers==4.31.0"

git clone git@github.com:Haiyang-W/GiT.git
cd GiT
pip install -v -e .
pip install -r requirements/optional.txt
pip install -r requirements/runtime.txt

# if you face ChildFailedError, please update yapf
pip install yapf==0.40.1

Please download pretrained text embedding from huggingface and organize the downloaded files as follows:

GiT
|──bert_embed.pt
|——bert_embed_large.pt
|——bert_embed_huge.pt

(Optional) Install Java manually for image caption evaluation. Without Java, you can train image caption normally, but fail in caption evaluation.
(Optional) Install lvis api for LVIS dataset.

# current path is ./GiT
cd ..
pip install git+https://github.com/lvis-dataset/lvis-api.git

Dataset Preparation

Multi-Tasking Dataset

Multi-tasking benchmark contains coco2017 for object detection and instance segmentation, ade20k for semantic segmentation, coco caption for image caption, and refcoco series for visual grounding.

GiT
|──data
|  |──ade
|  |  |──ADEChallengeData2016
|  |  |  |──annorations
|  |  |  |  |──training & validation
|  |  |  |──images
|  |  |  |  |──training & validation
|  |  |  |──objectInfo150.txt
|  |  |  |──sceneCategories.txt
|  |──coco
|  |  |──annotations
|  |  |  |──*.json
|  |  |──train2017
|  |  |  |──*.jpg
|  |  |──val2017
|  |  |  |──*.jpg
|  |──coco_2014
|  |  |──annotations
|  |  |  |──*.json
|  |  |  |──coco_karpathy_test.json
|  |  |  |──coco_karpathy_train.json
|  |  |  |──coco_karpathy_val_gt.json
|  |  |  |──coco_karpathy_val.json
|  |  |──train2014
|  |  |  |──*.jpg
|  |  |──val2014
|  |  |  |──*.jpg
|  |  |──refcoco
|  |  |  |──*.p

bash tools/dist_train.sh configs/GiT/single_detection_base.py  ${GPU_NUM} --work-dir ${work_dir}

Multi Task

GiT-B

bash tools/dist_train.sh configs/GiT/multi_fivetask_base.py  ${GPU_NUM} --work-dir ${work_dir}

Universal Training

GiT-B

bash tools/dist_train.sh configs/GiT/universal_base.py  ${GPU_NUM} --work-dir ${work_dir}

Testing

Single Task

Detection

bash tools/dist_test.sh configs/GiT/single_detection_base.py ${ckpt_file} ${GPU_NUM} --work-dir ${work_dir}

Multi Task

GiT-B

bash tools/dist_test.sh configs/GiT/multi_fivetask_base.py ${ckpt_file} ${GPU_NUM} --work-dir ${work_dir}

Zero-shot and few-shot

Please download universal pretrain weight from huggingface and organize files as follows:

GiT
|──universal_base.pth
|——universal_large.pth
|——universal_huge.pth

Zero-shot

bash tools/dist_test.sh configs/GiT/zero-shot/zero_shot_cityscapes_det_base.py ${ckpt_file} ${GPU_NUM} --work-dir ${work_dir}

Few-shot

bash tools/dist_train.sh configs/GiT/few-shot/few_shot_drive_det_base.py ${GPU_NUM} --work-dir ${work_dir}

Release the arXiv version.
SOTA performance of generalist model on multi-tasking benchmark.
SOTA performance of generalist model on zero- and few-shot benchmark.
Clean up and release the inference code.
Clean up and release the training code.
Engineering Optimization (faster).
Joint Training including Language (stronger).
Code Refactoring (now is also a little dirty, sorry for that).

👍 Acknowledgement

MMDetection The codebase we built upon. Thanks for providing such a convenient framework.
BLIP We extract text embedding from BLIP pretrain models and use the web caption filtered by BLIP. Thanks for their efforts in open source and cleaning the dataset.

📘 Citation

Please consider citing our work as follows if it is helpful.

@inproceedings{wang2024git,
  title={GiT: Towards Generalist Vision Transformer through Universal Language Interface},
  author={Wang, Haiyang and Tang, Hao and Jiang, Li and Shi, Shaoshuai and Naeem, Muhammad Ferjad and Li, Hongsheng and Schiele, Bernt and Wang, Liwei},
  booktitle={ECCV},
  year={2024}
}

The first GPT-style general vision model unifies various vision tasks only with a vanilla ViT. No negative transfer.

📣 News

💫 What we want to do

The Model Architectures across various AI domains are converging towards Multi-Layer Plain Transformers.

Reducing Human Bias in Model Architecture Designing

🤔 What we achieve

Overview

🚀 Main Results

Single-Task Benchmark

Multi-Tasking Benchmark

Task Synergy in Multi-Tasking Training

Zero-shot benchmark

Few-shot benchmark

🛠️ Quick Start

Installation

Dataset Preparation

Multi-Tasking Dataset

Universal Dataset

Training

Single Task

Multi Task

Universal Training

Testing

Single Task

Multi Task

Zero-shot and few-shot

Customize Dataset

🚀 Lightweight Version

👀 Todo

👍 Acknowledgement

📘 Citation

✨ Star History