Study on LLM Backbones

April 27, 2024 ยท View on GitHub

Overview

This study investigates the impact of Large Language Model (LLM) backbones on both whole image-level and region-level vision-language benchmarks. We employ various models such as Vicuna-1.5-7B, Vicuna-1.5-13B, Llama-3-8B, and Phi-3-mini-3.8B as the language model backbone for both LLaVA-1.5 and ViP-LLaVA, keeping all other configurations and hyper-parameters consistent.

Figure 1. Following the LLaVA-1.5 setting, we change the LLM backbone into Llama-3-8B, and Phi-3-mini-3.8B. They both show better language reasoning capability, such as in MMBench and ScienceQA.

Benchmarking Datasets

We leverage diverse whole image-level and region-level tasks to benchmark different LLM backbones.

Whole Image-Level Benchmarks

These benchmarks are derived from the official LLaVA-1.5 pipeline:

Region-Level Benchmarks

Region-level benchmarks include:

Results

Results for these two types of LMMs, LLaVA-1.5 and ViP-LLaVA, are displayed in the corresponding tables and radar plots linked below. The huggingface checkpoints are provided in the hyperlink.

ModelMMBenchMMBench_cnLLaVA_WPOPEScienceQAMMVetVizWizMMETextVQAVQAv2GQASEED-IMG
Vicuna-1.5-7B64.657.572.287.369.531.550.01506.558.278.363.266.2
Vicuna-1.5-13B67.763.674.887.471.635.453.61531.061.380.063.368.2
Llama-3-8B72.365.874.087.175.235.950.01496.158.079.663.269.0
Phi-3-mini-3.8B69.059.673.187.372.333.635.31424.555.077.661.167.3

Table 1. Empirical Results of LLaVA-1.5 under different LLM backbone.

ModelMMBenchMMBench_cnLLaVA_WPOPEScienceQAMMVetVizWizMMETextVQAVQAv2GQASEED-IMGV7WPointQAViP-Bench (BBox)ViP-Bench (Human)
Vicuna-1.5-7B68.059.369.887.169.533.155.71453.557.879.262.269.086.671.348.448.3
Vicuna-1.5-13B70.360.775.387.470.034.557.41564.059.680.162.970.787.971.848.348.2
Llama-3-8B71.064.769.787.572.831.153.91492.756.178.962.069.784.370.245.445.0
Phi-3-mini-3.8B70.460.571.588.172.429.834.71416.255.278.461.269.685.369.649.048.2

Table 2. Empirical Results of ViP-LLaVA under different LLM backbone.

The overall charts are here:

Figure 2. Following the LLaVA-1.5 and ViP-LLaVA settings, we change the LLM backbone into Llama-3-8B, and Phi-3-mini-3.8B. Better language reasoning capability are observed. Yet tasks that require core visual understanding capability own similar performance.

Key Findings

  • Language and Commonsense Reasoning: Recent LLMs, Llama-3 and Phi-3, excel in tasks requiring language and commonsense reasoning. Llama-3-8B and Phi-3-mini-3.8B outperform Vicuna-1.5-13B significantly in benchmarks such as MMBench and ScienceQA.
  • Visual vs. Language Capabilities: Llama-3-8B and Phi-3-mini-3.8B do not enhance performance significantly in tasks primarily requiring visual understanding. Vicuna-1.5-13B still leads in benchmarks like MME, TextVQA, and GQA.
  • Zero-shot Vision-Language Tasks: Phi-3-mini-3.8B shows limited effectiveness in zero-shot vision-language tasks like VizWiz, while performing comparably to Vicuna-1.5-7B in most other tasks.
  • Overall Performance: Llama-3-8B generally performs better than Vicuna-1.5-7B and is on par with Vicuna-1.5-13B. Phi-3-mini-3.8B, however, underperforms Vicuna-1.5-13B on average.
  • Consistency in ViP-LLaVA: ViP-LLaVA maintains performance consistency across various whole image understanding benchmarks when compared to LLaVA-1.5.

For detailed tables and radar charts, please refer to our paper, which will appear in Supplementary Materials A.5.

Training Scripts

LLaVA-1.5

## Llama-3

###pretrain
bash ./scripts/pretrain_llava_1_5_llama3.sh
###finetune
bash ./scripts/finetune_llava_1_5_llama3.sh

## Phi-3

###pretrain
bash ./scripts/pretrain_llava_1_5_phi3.sh
###finetune
bash ./scripts/finetune_llava_1_5_phi3.sh

ViP-LLaVA

## Llama-3

###pretrain
bash ./scripts/pretrain_vip_llava_llama3.sh
###finetune
bash ./scripts/finetune_vip_llava_llama3_stage2.sh
bash ./scripts/finetune_vip_llava_llama3_stage3.sh

## Phi-3

###pretrain
bash ./scripts/pretrain_vip_llava_phi3.sh
###finetune
bash ./scripts/finetune_vip_llava_phi3_stage2.sh
bash ./scripts/finetune_vip_llava_phi3_stage3.sh

Evaluation Scripts

Please follow the instructions for image-level benchmarks and object-level benchmarks.

Citation

If you find our study on LLM backbone useful for your research and applications, please cite using this BibTeX:


@inproceedings{cai2024vipllava,
  author      = {Cai, Mu and Liu, Haotian and Mustikovela,  Siva Karthik and Meyer, Gregory P. and Chai, Yuning and Park, Dennis and Lee, Yong Jae},
  title       = {Making Large Multimodal Models Understand Arbitrary Visual Prompts},
  booktitle   = {IEEE Conference on Computer Vision and Pattern Recognition},
  year        = {2024}
}