Model Zoo

December 14, 2023 ยท View on GitHub

If you are interested in including any other details in Model Zoo, please open an issue :)

The usage of ShareGPT4V checkpoints should comply with the base LLM's model license: Llama 2.

ShareGPT4V models

NameLLMCheckpointLLaVA-Bench-WildMME-perceptionMME-cognitionMMBenchMMBench-CNSEED-imageMM-VetQBenchSQA-imageVQA-v2VizWizGQATextVQA
ShareGPT4V-7BVicuna-7BShareGPT4V-7B72.61567.4376.468.862.269.737.663.468.480.657.263.360.4
ShareGPT4V-13BVicuna-13BShareGPT4V-13B79.91618.7303.268.563.770.843.165.271.281.055.664.862.2

Pretrained Vision Encoders

These are vision encoder weights we have pretrained. You can use these weights for our or your own visual instruction tuning. They are just pretrained on ShareGPT4V-PT image-text pairs and are NOT instruction-tuned, which means they do NOT follow instructions as well as our official models and can output repetitive, lengthy, and garbled outputs. If you want to have nice conversations with ShareGPT4V models, use the checkpoints above (in ShareGPT4V models).

Base LLMVision EncoderProjectionPretrain DataPretraining scheduleDownload
Vicuna-13B-v1.5CLIP-L-336px-ft-l12MLP-2xShareGPT4V-PT-1.2M1evision encoder
Vicuna-7B-v1.5CLIP-L-336px-ft-l12MLP-2xShareGPT4V-PT-1.2M1evision encoder

Pretrained Projector and LLM

These are projector and LLM weights we have pretrained. You can use these weights for our or your own visual instruction tuning. They are just pretrained on ShareGPT4V-PT image-text pairs and are NOT instruction-tuned, which means they do NOT follow instructions as well as our official models and can output repetitive, lengthy, and garbled outputs. If you want to have nice conversations with ShareGPT4V models, use the checkpoints above (in ShareGPT4V models).

Base LLMVision EncoderProjectionPretrain DataPretraining scheduleDownload
Vicuna-13B-v1.5CLIP-L-336px-ft-l12MLP-2xShareGPT4V-PT-1.2M1eprojector and LLM
Vicuna-7B-v1.5CLIP-L-336px-ft-l12MLP-2xShareGPT4V-PT-1.2M1eprojector and LLM