README.md
November 14, 2024 ยท View on GitHub
[TMLR] SOLO: A Single Transformer for Scalable
Vision-Language Modeling
๐ Paper โข ๐ค Model (SOLO-7B)
We present SOLO, a single Transformer architecture for unified vision-language modeling.
SOLO accepts both raw image patches (in pixels) and texts as inputs, without using a separate pre-trained vision encoder.
TODO Roadmap
ย โ Release the instruction tuning data mixture
ย โ Release the code for instruction tuning
ย โ Release the pre-training code
ย โ Release the SOLO model ๐ค Model (SOLO-7B)
ย โ Paper on arxiv ๐ Paper
Setup
Clone Repo
git clone https://github.com/Yangyi-Chen/SOLO
git submodule update --init --recursive
Setup Environment for Data Processing
conda env create -f environment.yml
conda activate solo
OR simply
pip install -r requirements.txt
SOLO Inference with Huggingface
Check scripts/notebook/demo.ipynb for an example of performing inference on the model.
Pre-Training
Please refer to PRETRAIN_GUIDE.md for more details about how to perform pre-training. The following table documents the data statistics in pre-training:

Instruction Fine-Tuning
Please refer to SFT_GUIDE.md for more details about how to perform instruction fine-tuning. The following table documents the data statistics in instruction fine-tuning:

Citation
If you use or extend our work, please consider citing our paper.
@article{chen2024single,
title={A Single Transformer for Scalable Vision-Language Modeling},
author={Chen, Yangyi and Wang, Xingyao and Peng, Hao and Ji, Heng},
journal={arXiv preprint arXiv:2407.06438},
year={2024}
}