README.md

July 17, 2024 · View on GitHub


Logo

u-LLaVA: Unifying Multi-Modal Tasks via Large Language Model

Multi-modal multi task LLM
Documentation | 中文文档

Paper · Report Bug · Request Feature

🎉 News

Table of Contents
  1. About The Project
  2. Results
  3. Getting Started
  4. License
  5. Citation
  6. Acknowledgments

About The Project

Structure:

Examples

(back to top)

Demo is coming soon.

Features

Code

  • Epoch Quantitative Evaluation

    • Compute metrics
  • Mixed Datasets

    • Dataset scale specification (portion)
    • Text, Image-Text, Video-Text
  • DeepSpeed

  • LoRA

Task

  • Visual Understanding
    • Image Captioning
    • Video Captioning
    • Visual Question Answering (VQA)
  • Visual Segmentation
    • Referring Expression Segmentation (RES)
    • Salient Object Segmentation
    • Semantic Segmentation
  • Visual Grounding
    • Referring Expression Comprehension (REC)

(back to top)

Model Release

ModelsImages/Videos
u-LLaVAuLLaVA Stage 2

RESULTS

RES

REC

SALIENT

General MLLM

Fine-tuneScienceQAMM-BenchSeed-Bench
u-LLaVA-7B87.74soonsoon

Video QA

zero-shotAccuracy (Type 3)
Activity-QA51.70%

Getting Started

Requirements

Run the following commands in terminal:

pip install -r ./shells/requirements.txt
cd ./models/GroundingDINO && ./install.sh && cd ../..

Why do these?

  1. install requirements: pip install -r requirements.txt
  2. build cuda core for GroundingDINO: cd ./models/GroundingDINO && ./install.sh && cd ../.., if not may arise UserWarning: Failed to load custom C++ ops. Running on CPU mode Only! warnings.warn("Failed to load custom C++ ops. Running on CPU mode Only!")

Datasets

Annotation download link: ullava modified annotations, LLaVA pretrain annotations and LLaVA finetuning annotaions

Image storage (download link can be found in the table):

image_root
├─ade20k
│  ├─annotations
│  └─images
├─coco2014
│  ├─test2014
│  ├─train2014
│  └─val2014
├─coco2017
│  ├─annotations
│  ├─train2017
│  └─val2017
├─cocostuff
│  ├─train2017
│  └─val2017
├─LLaVA-CC3M-Pretrain-595K
│  └─images
├─saiapr_tc-12
│  ├─00
│  └─01
└─vlpart
    ├─paco
    │  └─annotations
    └─pascal-part
        ├─Annotations_Part
        ├─examples
        └─VOCdevkit

where ade20k is extracted from ADEChallengeData2016.zip and cocostuff is extracted from stuffthingmaps_trainval2017.zip, respectively.

Stage I: Pre-training

DatasetImages/VideosAnnotations
LLaVA CC3MLLaVA-CC3M-Pretrain-595K/image.zipchat.json
TGIFTGIF - Quark Drive tgif.json

Note: We have renamed the TGIF dataset and removed invalid samples to facilitate training, but please follow the original LICENSE.

Stage II: Fine-tuning

DatasetImagesAnnotations
LLaVA Instruction 150Kcoco2017llava_instruct_150k.json
RefCOCOcoco2014refcoco_train.json
RefCOCOgcoco2014refcocog_train.json
RefCOCO+coco2014refcoco+_train.json
RefCLEFsaiapr_tc-12refclef_train.json
ADE20Kade20kade20k.json
COCO Stuffcocostuffcocostuff.json
VOC2010voc2010pascal_part.json
PACO LVISpacopaco_lvis.json
Salient 15Kmsraullava_salinet_15k.json

Note: Please download the images of MSRA-10K and MSRA-B from the official site, thanks the authors for sharing.

Dataset config example

dataset:
  llava:
    data_type: 'image'
    image_token_len: 256
    build_info:
      anno_dir: '/path_to_annotations/llava_instruct_150k.json'
      image_dir: '/path_to_image_root/coco2017/train2017'
      portion: 1.0
    vis_processor: 'clip_image'

  refcoco+:
    data_type: 'image'
    image_token_len: 256
    build_info:
      anno_dir: '/path_to_annotations/refcoco+_train.json'
      image_dir: '/path_to_image_root/coco2014'
      template_root: './datasets/templates/SEG.json'
      portion: 1.0
    vis_processor: 'clip_image'

Note:

  1. We re-organize most of the dataset annotations for easier training, but all of us must follow the rules that the original datasets require.

Training

Stage I: Pre-training

  1. Prepare Open-Source LLaMA models
Foundation modelVersionPath
Vicuna 7B HFV1.1vicuna_7b_v1.1
LLaMA2 7B HF-meta-llama/Llama-2-7b-hf
SAMViT-Hsam_vit_h_4b8939.pth
GroundingDINOswint_ogcgroundingdino_swint_ogc.pth

Note:

- LLaMA2 is trained with bf16, convergence error may happen when stage 1 training with fp16.

- The default tokenizer.legacy of Llama-2 is False, and may rise tokenization mismatch error with some conversation template.

- Errata: The base LLM used in the paper is Vicuna-v1.1, not LLaMA2. Sorry about the mistake.

  1. Prepare datasets
  2. Set config in
configs/train/ullava_core_stage1.yaml

Note set all datasets path or output path according to your experiments. 4. Train Stage I with multi GPUs

./shells/pretrain.sh

or python train_ullava_core.py --cfg_path './configs/train/ullava_core_stage1.yaml' for 1 GPU.

The first stage with 4 A100 80G with bf16 costs ~6hours for 1 epoch. Then you can find the trained model at the output_dir, for example, './exp/ullava_core_7b'

Stage II: Fine-tuning

After Stage I training finished, we can go through the following step, that is, fine-tuning.

  1. Prepare datasets
  2. Set config in
configs/train/ullava_stage2_lora.yaml (for lora)
configs/train/ullava_stage2.yaml (for non lora)
  1. Train Stage II with multi GPUs
./shells/finetune.sh

or python train_ullava.py --cfg_path './configs/train/ullava_stage2_lora.yaml' for 1 GPU.

Common Question

Q1: What conv_tpye used in training?

A1: Stage I: 'conv_simple'. Stage II: 'conv_sep2'

Q2: When LoRA used?

A2: Stage I: We have not used in this stage. Stage II: According to your devices.

(back to top)

Evaluation

Batch evaluation

  1. Set config
configs/eval/eval_res.ymal (for RES task)
configs/eval/eval_rec.ymal (for REC task)
configs/eval/eval_salient.ymal (for Salinet segmentation task)
  1. Run
python evaluation/eval_ullava.py --cfg_path './configs/eval/eval_res.yaml' (for RES)
python evaluation/eval_ullava_grounding.py --cfg_path './configs/eval/eval_rec.yaml' (for REC)
python evaluation/eval_ullava.py --cfg_path './configs/eval/eval_salient.yaml' (for Salinet)

(back to top)

Qualitative inference

Modify the parser in the evaluation/inference_ullava_core.py and evaluation/inference_ullava.py for stage I and stage II, respectively.

python evaluation/eval_ullava.py
python evaluation/eval_ullava_grounding.py 

(back to top)

License

Distributed under the Apache License. See LICENSE for more information.

(back to top)

Citation

@inproceedings{xu2024ullava,
  title={u-LLaVA: Unifying Multi-Modal Tasks via Large Language Model},
  author={Xu, Jinjin and Xu, Liwu and Yang, Yuzhe and Li, Xiang and Wang, Fanyi and Xie, Yanchun and Huang, Yi-Jie and Li, Yaqian},
  booktitle={Proceedings of the 27th European Conference on Artificial Intelligence},
  year={2024}
}

(back to top)

TODO

  • Visual Segmentation
    • Instance Segmentation

(back to top)

Acknowledgments

We sincerely thank the open source community for their contributions. And this work is sponsored by Shanghai Pujiang Program (23PJ1421800).

(back to top)

See the open issues for a full list of proposed features (and known issues).

(back to top)