README_zh.md

July 17, 2024 · View on GitHub

u-LLaVA: Unifying Multi-Modal Tasks via Large Language Model

多模态多任务LLM
Documentation | 中文文档

论文 · 反馈Bug · 新特性

🎉 News

[2024/07] 我们将很快更新[grounding&segmentation版ullava权重][].
[2024/07] 336分辨率支持，MM-Bench, TextVQA, SQA, GQA 结果更新.
[2024/07] [Salient-15k][salient_15k] 标注文件已更新.
[2024/07] 本工作已被ECAI 2024接收，感谢所有前序工作！
[2024/01] [代码][https://github.com/OPPOMKLab/u-LLaVA/] and 分割版本权重已开源.
[2023/10] [论文][https://arxiv.org/abs/2311.05348] 已提交.

关于项目
- 特色
开始
- 配置要求
- 数据集
- 训练
- 测试
License
引用
致谢

关于项目

模型结构:

样例：

(back to top)

Demo即将上线。

特色

代码

任务

(back to top)

模型开源

Models	Images/Videos
u-LLaVA	uLLaVA Stage 2

开始

配置要求

终端运行以下命令:

pip install -r ./shells/requirements.txt
cd ./models/GroundingDINO && ./install.sh && cd ../..

指令意义：

安装ullava所需库: pip install -r requirements.txt
构建GroundingDINO cuda依赖库: cd ./models/GroundingDINO && ./install.sh && cd ../.., 如果之前未配置，可能会出现以下告警 UserWarning: Failed to load custom C++ ops. Running on CPU mode Only! warnings.warn("Failed to load custom C++ ops. Running on CPU mode Only!")
如果GroundingDINO出现问题，可以关闭掉代码中所有GroundingDINO相关模块，以纯分割形式运行

数据集

我们对使用到的数据集的标注文件进行了重构，方便训练和理解，请下载我们重构后的标注文件。

下载链接: ullava modified annotations, LLaVA pretrain annotations and LLaVA finetuning annotaions

训练图像存储示例 (后表中有图像文件下载链接):

image_root
├─ade20k
│  ├─annotations
│  └─images
├─coco2014
│  ├─test2014
│  ├─train2014
│  └─val2014
├─coco2017
│  ├─annotations
│  ├─train2017
│  └─val2017
├─cocostuff
│  ├─train2017
│  └─val2017
├─LLaVA-CC3M-Pretrain-595K
│  └─images
├─saiapr_tc-12
│  ├─00
│  └─01
└─vlpart
    ├─paco
    │  └─annotations
    └─pascal-part
        ├─Annotations_Part
        ├─examples
        └─VOCdevkit

其中 ade20k 由 ADEChallengeData2016.zip 解压并重命名，cocostuff由 stuffthingmaps_trainval2017.zip解压并重命名。

Stage I: 预训练

Dataset	Images/Videos	Annotations
LLaVA CC3M	LLaVA-CC3M-Pretrain-595K/image.zip	chat.json
TGIF	TGIF - Quark Drive	tgif.json

请注意：我们对TGIF数据集进行了重命名并剔除了无效样本，以方便训练，但请大家遵循原始TGIF数据集的LICENSE。

Stage II: 微调

Dataset	Images	Annotations
LLaVA Instruction 150K	coco2017	llava_instruct_150k.json
RefCOCO	coco2014	refcoco_train.json
RefCOCOg	coco2014	refcocog_train.json
RefCOCO+	coco2014	refcoco+_train.json
RefCLEF	saiapr_tc-12	refclef_train.json
ADE20K	ade20k	ade20k.json
COCO Stuff	cocostuff	cocostuff.json
VOC2010	voc2010	pascal_part.json
PACO LVIS	paco	paco_lvis.json
Salient 15K	coming soon	coming soon

数据集配置示例

dataset:
  llava:
    data_type: 'image'
    image_token_len: 256
    build_info:
      anno_dir: '/path_to_annotations/llava_instruct_150k.json'
      image_dir: '/path_to_image_root/coco2017/train2017'
      portion: 1.0
    vis_processor: 'clip_image'

  refcoco+:
    data_type: 'image'
    image_token_len: 256
    build_info:
      anno_dir: '/path_to_annotations/refcoco+_train.json'
      image_dir: '/path_to_image_root/coco2014'
      template_root: './datasets/templates/SEG.json'
      portion: 1.0
    vis_processor: 'clip_image'

训练

Stage I: 预训练

准备开源模型

Foundation model	Version	Path
Vicuna 7B HF	V1.1	vicuna_7b_v1.1
LLaMA2 7B HF	-	meta-llama/Llama-2-7b-hf
SAM	ViT-H	sam_vit_h_4b8939.pth
GroundingDINO	swint_ogc	groundingdino_swint_ogc.pth

Note:

- LLaMA2 由 bf16训练, 如果以 fp16进行一阶段训练时，可能出现收敛错误.

- LLaMA2 默认的 tokenizer.legacy 为 False, 因此使用某些 conversation 模板时可能出现编解码错误.

- 更正: 论文中使用的基模型为 Vicuna-v1.1, 而不是LLaMA2，非常抱歉出现了笔误.

准备数据集
设置配置文件

configs/train/ullava_core_stage1.yaml

请注意配置好所有图像路径和模型路径. 4. 多GPU训练Stage I

./shells/pretrain.sh

或者单 GPU python train_ullava_core.py --cfg_path './configs/train/ullava_core_stage1.yaml' .

第一阶段使用 4 个 A100 80G 和 bf16，1 个周期花费约 6 小时。然后你可以在output_dir找到训练好的模型，例如，“./exp/ullava_core_7b”

Stage II: 微调

Stage I 完成之后，即可以进行下一阶段的训练，

准备数据集
设置配置文件

configs/train/ullava_stage2_lora.yaml (for lora)
configs/train/ullava_stage2.yaml (for non lora)

多GPU训练

./shells/finetune.sh

或者单GPU LoRA微调：python train_ullava.py --cfg_path './configs/train/ullava_stage2_lora.yaml'` .

常见问题

Q1: 使用了哪种conversation 模板?

A1: Stage I: 'conv_simple'. Stage II: 'conv_sep2'

Q2: 什么时候使用LoRA?

A2: Stage I: 我们未使用. Stage II: 根据您的设备.

(back to top)

测试

批量量化测试

配置文件

configs/eval/eval_res.ymal (for RES task)
configs/eval/eval_rec.ymal (for REC task)
configs/eval/eval_salient.ymal (for Salinet segmentation task)

运行

python evaluation/eval_ullava.py --cfg_path './configs/eval/eval_res.yaml' (for RES)
python evaluation/eval_ullava_grounding.py --cfg_path './configs/eval/eval_rec.yaml' (for REC)
python evaluation/eval_ullava.py --cfg_path './configs/eval/eval_salient.yaml' (for Salinet)

(back to top)

定性测试

调整 evaluation/inference_ullava_core.py 和evaluation/inference_ullava.py 的argparser配置，进行一阶段和二阶段的定性测试

python evaluation/eval_ullava.py
python evaluation/eval_ullava_grounding.py

(back to top)

License

Distributed under the Apache License. See LICENSE for more information.

(back to top)

引用

@inproceedings{xu2024ullava,
  title={u-LLaVA: Unifying Multi-Modal Tasks via Large Language Model},
  author={Xu, Jinjin and Xu, Liwu and Yang, Yuzhe and Li, Xiang and Wang, Fanyi and Xie, Yanchun and Huang, Yi-Jie and Li, Yaqian},
  booktitle={Proceedings of the 27th European Conference on Artificial Intelligence},
  year={2024}
}

(back to top)

待办

Visual Segmentation
- Instance Segmentation

(back to top)

致谢

由衷感谢以下开源工作的贡献，且本工作由上海市白玉兰浦江人才计划支持 (项目编号：23PJ1421800)。

(back to top)

See the open issues for a full list of proposed features (and known issues).

(back to top)