$\mathcal{D}$-Attn: Decomposed Attention for Large Vision-and-Language Model
August 15, 2025 · View on GitHub
Large Vision-and-Language Model with linear computational complexity for vision modality and stronger VL capability.
-Attn: Decomposed Attention for Large Vision-and-Language Models [Paper]
Chia-Wen Kuo, Sijie Zhu, Fan Chen, Xiaohui Shen, Longyin Wen
Vidi: Large Multimodal Models for Video Understanding and Editing [Webpage] [Paper] [Code]
Intelligent Editing Team, ByteDance Inc.
Contents
Install
- Clone this repository and navigate to the dattn folder
git clone https://github.com/bytedance/DecomposedAttention
cd DecomposedAttention
- Install packages
conda create -n dattn python=3.11 -y
conda activate dattn
bash run/install.sh
Model Weights
Coming soon.
Training
Data preparation
Download json annotation files here, including blip_laion_cc_sbu_558k.json for the alignment, shrcap_filtered.json for pre-training, and llava_gpt4v_filtered.json for sft.
Download images following ShareGPT4V, including LAION-CC-SBU-558K, COCO, WebData, SAM, GQA, OCR-VQA, TextVQA, VisualGenome.
Organize downloaded data as follows:
DecomposedAttention
├── ...
├── data
│ ├── blip_laion_cc_sbu_558k.json
│ ├── shrcap_filtered.json
│ ├── llava_gpt4v_filtered.json
│ ├── train
│ | ├── llava
│ │ │ ├── llava_pretrain
│ │ │ │ ├── images
│ │ ├── coco
│ │ │ ├── train2017
│ │ ├── sam
│ │ │ ├── images
│ │ ├── gqa
│ │ │ ├── images
│ │ ├── ocr_vqa
│ │ │ ├── images
│ │ ├── textvqa
│ │ │ ├── train_images
│ │ ├── vg
│ │ │ ├── VG_100K
│ │ │ ├── VG_100K_2
│ │ ├── share_textvqa
│ │ │ ├── images
│ │ ├── web-celebrity
│ │ │ ├── images
│ │ ├── web-landmark
│ │ │ ├── images
│ │ ├── wikiart
│ │ │ ├── images
├── ...
Mistral 7B v0.3
# all training ckpts will be stored in the ./checkpoints folder
mkdir -p checkpoints
# multimodal alignment stage
bash run/mistral_aln.sh
# multimodal pre-training stage
bash run/mistral_pt.sh
# instruction tuning stage
bash run/mistral_it.sh
Gemma 2 9B
# all training ckpts will be stored in the ./checkpoints folder
mkdir -p checkpoints
# multimodal alignment stage
bash run/gemma_aln.sh
# multimodal pre-training stage
bash run/gemma_pt.sh
# instruction tuning stage
bash run/gemma_it.sh
Evaluation
Data preparation
Follow LLaVA to download ScienceQA, MME, GQA, POPE, TextVQA, SEED-Bench, LLaVA-Bench-in-the-Wild, MM-Vet, VQAv2, MMBench, and VisWiz.
Follow MMStar to download MMStar benchmark.
Organize downloaded benchmarks as follows:
DecomposedAttention
├── ...
├── data
│ ├── val
│ | ├── scienceqa
│ │ ├── MME
│ │ ├── gqa
│ │ ├── pope
│ │ ├── textvqa
│ │ ├── seed_bench
│ │ ├── llava-bench-in-the-wild
│ │ ├── mm-vet
│ │ ├── vqav2
│ │ ├── mmbench
│ │ ├── vizwiz
├── ...
Evaluate all
# suppose we have 8 GPUs on a machine
# evaluate Mistral 7B v0.3
CUDA_VISIBLE_DEVICES=0,1,2,3,4,5,6,7 bash run/mistral_eval.sh
# evaluate Gemma 2 9B
CUDA_VISIBLE_DEVICES=0,1,2,3,4,5,6,7 bash run/gemma_eval.sh
Citation
If you find -Attn useful for your research and applications, please cite our works:
@article{kuo2025rethinking,
title={D-Attn: Decomposed Attention for Large Vision-and-Language Models},
author={Kuo, Chia-Wen and Zhu, Sijie and Chen, Fan and Shen, Xiaohui and Wen, Longyin},
journal={arXiv preprint arXiv:2502.01906},
year={2025}
}
@article{team2025vidi,
title={Vidi: Large Multimodal Models for Video Understanding and Editing},
author={Vidi Team, and Liu, Celong and Kuo, Chia-Wen and Du, Dawei and Chen, Fan and Chen, Guang and Yuan, Jiamin and Zhang, Lingxi and Guo, Lu and Li, Lusha and others},
journal={arXiv preprint arXiv:2504.15681},
year={2025}
}