Multimodal Speculative Decoding (MSD)

July 21, 2025 ยท View on GitHub

๐Ÿ“„ Paper on arXiv Speculative Decoding Reimagined for Multimodal Large Language Models

๐Ÿง  MSD Models

You can directly use the Multimodal Speculative Decoding (MSD) models available on Hugging Face:


๐Ÿงฑ 1. Setup & Installation

conda create -n msd python=3.10 -y
conda activate msd
# Ensure CUDA 12.1 is installed and configured

cd LLaVA
pip install -e .
cd ../EAGLE
pip install -e .
cd ../lmms-eval
pip install -e .

๐Ÿ“ฅ 2. Download Datasets

Download the annotations used for instruction tuning:

Then download the image data from the following datasets:

After downloading, organize the data under ./image_data in the following structure:

โ”œโ”€โ”€ coco
โ”‚   โ””โ”€โ”€ train2017
โ”œโ”€โ”€ gqa
โ”‚   โ””โ”€โ”€ images
โ”œโ”€โ”€ ocr_vqa
โ”‚   โ””โ”€โ”€ images
โ”œโ”€โ”€ textvqa
โ”‚   โ””โ”€โ”€ train_images
โ””โ”€โ”€ vg
    โ”œโ”€โ”€ VG_100K
    โ””โ”€โ”€ VG_100K_2

โš™๏ธ 3. Data Processing

Use the following script to generate training data. You can control the target model by setting the --model_type argument (e.g., llava_v15_t/v or qwen2_vl_t/v):

cd EAGLE/eagle/ge_data

CUDA_VISIBLE_DEVICES=0 python -m eagle.ge_data.allocation \
    --outdir <output_data_dir> \
    --model_type <model_type> \
    --model <base_model_path> \
    --image_data_path <image_data_dir> \
    --json_data_path <annotation_file>

๐Ÿ‹๏ธ 4. Train the Model

Use DeepSpeed to train the speculative decoding model. Modify the following paths according to your setup:

cd EAGLE/eagle/train

deepspeed --master_port 29504 --include localhost:0 main_deepspeed.py \
    --deepspeed_config ds_config.json \
    --tmpdir_v <visual_data_path> \
    --tmpdir_t <text_data_path> \
    --basepath <base_llm_path> \
    --cpdir <checkpoint_output_dir> \
    --config <training_config_path>

Parameters:

  • <visual_data_path>: directory containing preprocessed visual data
  • <text_data_path>: directory containing preprocessed text data
  • <training_config_path>: training configuration file, e.g., llava_v15_7B_config.json

๐Ÿ“Š 5. Evaluate the Model

Run evaluation with lmms-eval. The following example evaluates on the ChartQA task:

CUDA_VISIBLE_DEVICES=0 accelerate launch --num_processes=1 --main_process_port=29506 -m lmms_eval \
    --model <model_name> \
    --model_args pretrained="<base_model_path>" \
    --msd_model_path <msd_model_path> \
    --tasks chartqa \
    --batch_size 1 \
    --gen_kwargs temperature=0 \
    --use_msd \

Parameters:

  • <model_name>: short name identifier of your model, e.g., llava_msd or qwen2_vl_msd
  • <base_model_path>: path to the base pretrained model
  • <msd_model_path>: path to the MSD model