Multimodal Speculative Decoding (MSD)

July 21, 2025 · View on GitHub

📄 Paper on arXiv Speculative Decoding Reimagined for Multimodal Large Language Models

🧠 MSD Models

You can directly use the Multimodal Speculative Decoding (MSD) models available on Hugging Face:

MSD-LLaVA1.5-7B: lucylyn/MSD-LLaVA1.5-7B
MSD-LLaVA1.5-13B: lucylyn/MSD-LLaVA1.5-13B
MSD-Qwen2VL-7B-Instruct: lucylyn/MSD-Qwen2VL-7B-Instruct

🧱 1. Setup & Installation

conda create -n msd python=3.10 -y
conda activate msd
# Ensure CUDA 12.1 is installed and configured

cd LLaVA
pip install -e .
cd ../EAGLE
pip install -e .
cd ../lmms-eval
pip install -e .

📥 2. Download Datasets

Download the annotations used for instruction tuning:

ShareGPT_V3_unfiltered_cleaned_split.json
llava_v1_5_mix665k.json

⚠️ Before use, process llava_v1_5_mix665k.json with EAGLE/eagle/ge_data/convert.py to fix formatting issues.

Then download the image data from the following datasets:

COCO: train2017
GQA: images
OCR-VQA: Download script (Google Drive)

💡 Make sure all OCR-VQA images are saved as .jpg
TextVQA: train_val_images
Visual Genome: part1, part2

After downloading, organize the data under ./image_data in the following structure:

├── coco
│   └── train2017
├── gqa
│   └── images
├── ocr_vqa
│   └── images
├── textvqa
│   └── train_images
└── vg
    ├── VG_100K
    └── VG_100K_2

⚙️ 3. Data Processing

Use the following script to generate training data. You can control the target model by setting the --model_type argument (e.g., llava_v15_t/v or qwen2_vl_t/v):

cd EAGLE/eagle/ge_data

CUDA_VISIBLE_DEVICES=0 python -m eagle.ge_data.allocation \
    --outdir <output_data_dir> \
    --model_type <model_type> \
    --model <base_model_path> \
    --image_data_path <image_data_dir> \
    --json_data_path <annotation_file>

🏋️ 4. Train the Model

Use DeepSpeed to train the speculative decoding model. Modify the following paths according to your setup:

cd EAGLE/eagle/train

deepspeed --master_port 29504 --include localhost:0 main_deepspeed.py \
    --deepspeed_config ds_config.json \
    --tmpdir_v <visual_data_path> \
    --tmpdir_t <text_data_path> \
    --basepath <base_llm_path> \
    --cpdir <checkpoint_output_dir> \
    --config <training_config_path>

Parameters:

<visual_data_path>: directory containing preprocessed visual data
<text_data_path>: directory containing preprocessed text data
<training_config_path>: training configuration file, e.g., llava_v15_7B_config.json

📊 5. Evaluate the Model

Run evaluation with lmms-eval. The following example evaluates on the ChartQA task:

CUDA_VISIBLE_DEVICES=0 accelerate launch --num_processes=1 --main_process_port=29506 -m lmms_eval \
    --model <model_name> \
    --model_args pretrained="<base_model_path>" \
    --msd_model_path <msd_model_path> \
    --tasks chartqa \
    --batch_size 1 \
    --gen_kwargs temperature=0 \
    --use_msd \

Parameters:

<model_name>: short name identifier of your model, e.g., llava_msd or qwen2_vl_msd
<base_model_path>: path to the base pretrained model
<msd_model_path>: path to the MSD model