Vision Language Model Fine-Tuning and Inference

February 24, 2026 ยท View on GitHub

๐Ÿš€ Toward Inherently Robust VLMs Against Visual Perception Attacks

This repository contains shell scripts and Python configurations designed for fine-tuning and performing inference for the following vision-language models:

  • LLaVA-13B-LoRA
  • LLaVA-7B
  • MoE-LLaVA
  • MobileVLM
  • Qwen-VL
  • NVILA

This work was Accepted at the 2026 IEEE Intelligent Vehicles Symposium (IV 2026).


๐Ÿ› ๏ธ Installation and Setup

Vision Language Model Fine-Tuning and Inference

  1. Clone the repository:

    git clone MODEL-REPO
    
  2. Install required dependencies:

    pip install -r requirements.txt
    
  3. Set up DeepSpeed by following their official installation guide: DeepSpeed Documentation.


๐Ÿ“ Scripts Overview

1๏ธโƒฃ LLaVA 13B Fine-Tuning (LLaVA_13B_FINETUNE.sh)

This script fine-tunes the LLaVA 13B model using DeepSpeed.

Arguments Overview:

  • --model_name_or_path: Path to the pre-trained LLaVA 13B model.
  • --image_folder: Directory containing the training images.
  • --data_path: Directory containing the dataset in JSON format.
  • --output_dir: Directory where the fine-tuned model checkpoints will be saved.

Command Example:

CUDA_VISIBLE_DEVICES=0 PYTHONPATH=$WORKSPACE_DIR python3 $WORKSPACE_DIR/llava/train/train_mem.py \
    --deepspeed ./scripts/zero3.json \
    --model_name_or_path ./checkpoints/llava-v1.5-13B/ \
    --image_folder IMAGE_DIRECTORY \
    --data_path JSON_FILE_DIRECTORY \
    --output_dir OUTPUT_FINE_TUNED

2๏ธโƒฃ LLaVA 7B Fine-Tuning (LLaVA_7B_FINETUNE.sh)

This script fine-tunes the LLaVA 7B model using DeepSpeed.

Arguments Overview:

  • --model_name_or_path: Path to the pre-trained LLaVA 7B model.
  • --image_folder: Directory containing the training images.
  • --data_path: Directory containing the dataset in JSON format.
  • --output_dir: Directory where the fine-tuned model checkpoints will be saved.

Command Example:

CUDA_VISIBLE_DEVICES=0 PYTHONPATH=$WORKSPACE_DIR python3 $WORKSPACE_DIR/llava/train/train_mem.py \
    --deepspeed ./scripts/zero2.json \
    --model_name_or_path ./checkpoints/llava-v1.5-7B/ \
    --image_folder IMAGE_DIRECTORY \
    --data_path JSON_FILE_DIRECTORY \
    --output_dir OUTPUT_FINE_TUNED

3๏ธโƒฃ MoE-LLaVA Fine-Tuning (MoE_LLaVA_FINETUNE.sh)

This script fine-tunes the MoE-LLaVA model using DeepSpeed with the Mixture of Experts (MoE) method.

Arguments Overview:

  • --model_name_or_path: Path to the fine-tuned LLaVA model.
  • --image_folder: Directory containing the training images.
  • --data_path: Directory containing the dataset in JSON format.
  • --output_dir: Directory where the fine-tuned model checkpoints will be saved.

Command Example:

CUDA_VISIBLE_DEVICES=0 PYTHONPATH=$WORKSPACE_DIR python3 $WORKSPACE_DIR/moellava/train/train_mem.py \
    --moe_enable True \
    --model_name_or_path ./checkpoints/MoE-v1.5-7B/ \
    --image_folder IMAGE_DIRECTORY \
    --data_path JSON_FILE_DIRECTORY \
    --output_dir OUTPUT_FINE_TUNED

4๏ธโƒฃ MobileVLM Fine-Tuning (MobileVLM_FINETUNE.sh)

This script fine-tunes the MobileVLM model using DeepSpeed.

Arguments Overview:

  • --model_name_or_path: Path to the pre-trained MobileVLM model.
  • --image_folder: Directory containing the training images.
  • --data_path: Directory containing the dataset in JSON format.
  • --output_dir: Directory where the fine-tuned model checkpoints will be saved.

5๏ธโƒฃ Qwen-VL Fine-Tuning (Qwen-VL.sh)

This script fine-tunes the Qwen-VL model and evaluates it on a test set.

Arguments Overview:

  • --model_name_or_path: Path to the pre-trained Qwen-VL model.
  • --data_path: Directory containing the dataset in JSON format.
  • --output_dir: Directory where the fine-tuned model checkpoints will be saved.
  • --num_train_epochs: Number of training epochs.
  • --learning_rate: Learning rate for optimization.
  • --save_steps: Frequency of checkpoint saving.
  • --evaluation_strategy: Strategy for model evaluation.

Command Example:

CUDA_VISIBLE_DEVICES=0 torchrun --nproc_per_node 2 --nnodes 1 --node_rank 0 $WORKSPACE_DIR/finetune.py \
    --model_name_or_path ./checkpoints/Qwen/Qwen-VL-Chat \
    --data_path ./data/MY_DATASET/train.json \
    --output_dir ./checkpoints/Qwen-VL-finetuned \
    --num_train_epochs 1 \
    --learning_rate 1e-5 \
    --save_steps 1000 \
    --evaluation_strategy "no" \
    --logging_steps 1 \
    --deepspeed ./finetune/ds_config_zero3.json

6๏ธโƒฃ NVILA-Lite-8B Fine-Tuning and Evaluation (NVILA.sh)

This script performs end-to-end fine-tuning and evaluation of the NVILA-Lite-8B model using DeepSpeed and vila-infer. It supports both training and inference in one unified workflow.

Arguments Overview:

  • STAGE_PATH: Path to the pre-trained NVILA-Lite-8B model (default: Efficient-Large-Model/NVILA-Lite-8B).
  • DATA_MIXTURE: Name of the training dataset or mixture.
  • OUTPUT_DIR: Directory where the fine-tuned model and logs will be saved.

Command Example:

---

## ๐Ÿ“š File Structure
โ”œโ”€โ”€ LLaVA-13B-LoRA
โ”‚   โ”œโ”€โ”€ LICENSE
โ”‚   โ”œโ”€โ”€ llava
โ”‚   โ”œโ”€โ”€ LLaVA-13-LoRA.sh
โ”‚   โ””โ”€โ”€ scripts
โ”œโ”€โ”€ LLaVA-7B
โ”‚   โ”œโ”€โ”€ LICENSE
โ”‚   โ”œโ”€โ”€ llava
โ”‚   โ”œโ”€โ”€ LLaVA-7B.sh
โ”‚   โ””โ”€โ”€ scripts
โ”œโ”€โ”€ MobileVLM
โ”‚   โ”œโ”€โ”€ LICENSE
โ”‚   โ”œโ”€โ”€ mobilevlm
โ”‚   โ”œโ”€โ”€ MobileVLM.sh
โ”‚   โ””โ”€โ”€ scripts
โ”œโ”€โ”€ MoE-LLaVA
โ”‚   โ”œโ”€โ”€ LICENSE
โ”‚   โ”œโ”€โ”€ moellava
โ”‚   โ”œโ”€โ”€ MoE-LLaVA.sh
โ”‚   โ””โ”€โ”€ scripts
โ”œโ”€โ”€ Qwen-VL
โ”‚   โ”œโ”€โ”€ LICENSE
โ”‚   โ”œโ”€โ”€ Qwen-VL.sh
โ”‚   โ””โ”€โ”€ finetune
โ”œโ”€โ”€ NVILA
โ”‚   โ”œโ”€โ”€ LICENSE
โ”‚   โ”œโ”€โ”€ NVILA.sh
โ”‚   โ”œโ”€โ”€ scripts
โ”‚   โ””โ”€โ”€ llava
โ”œโ”€โ”€ Sample
โ”‚   โ”œโ”€โ”€ DRP-Attack
โ”‚   โ”œโ”€โ”€ RAUCA
โ”‚   โ””โ”€โ”€ Shadow-Attack


---

๐Ÿ“Š Training and Evaluation Metrics

Training scripts log progress via TensorBoard and W&B for visualization and debugging purposes. Modify logging steps and evaluation strategies as needed.


๐Ÿ”ง Customization

  • Batch size: 32 (training), 4 (evaluation)
  • Checkpoints saved every 50,000 steps
  • DeepSpeed configurations adjustable in zero2.json or zero3.json

By following this guide, you can efficiently fine-tune and infer using the LLaVA, MoE-LLaVA, MobileVLM, and Qwen-VL models.

Note: Ensure you have access to GPUs with adequate memory for fine-tuning large models.

Note: Ensure that you have access to GPUs with adequate memory for fine-tuning large models.
Note: The models are fine-tuned on an A100 40GB GPU, except for Qwen-VL (2ร—A100 80GB GPUs) and NVILA (4ร—A100 40GB GPUs).