MineDraft: A Framework for Batch Parallel Speculative Decoding

June 8, 2026 ยท View on GitHub

MineDraft Logo

MineDraft: A Framework for Batch Parallel Speculative Decoding

arXiv Project Blog License: MIT vLLM Python


MineDraft accelerates large language model inference by overlapping the drafting and verification stages of speculative decoding, hiding latency and unlocking substantial throughput gains in batch settings.


MetricImprovement over Standard SD
๐Ÿš€ Throughputup to +75%
โšก End-to-end Latencyup to โˆ’39%


Overview

Speculative decoding (SD) uses a small draft model to propose candidate tokens that a larger target model then verifies โ€” reducing the number of expensive forward passes. MineDraft leads this paradigm to parallel execution by overlapping the drafting and verification stages so that drafting latency is effectively hidden behind verification compute.

Experiments across Qwen3, Llama-3.3, and EAGLE models validate MineDraft's gains on ShareGPT, LMSYS Arena, and Spec-Bench benchmarks.


Table of Contents


Requirements

System

ComponentRequirement
OSLinux (tested: Ubuntu 22.04)
Python3.9 โ€“ 3.12 (tested: 3.12)
CUDAโ‰ฅ 11.8 (tested: 12.8)
GPUs5ร— NVIDIA with sufficient VRAM (A100 80GB / H100 / L40 recommended)

Core Dependencies

PackageVersion
vLLM0.9.2
PyTorch2.7.0
torch-scatter2.1.2

Installation

Step 1 โ€” Create a virtual environment (pick one):

venv
python -m venv venv
source venv/bin/activate
uv
uv venv --python 3.12 --seed
source venv/bin/activate
conda
conda create -n minedraft python=3.12 -y
conda activate minedraft

Step 2 โ€” Install vLLM:

pip install vllm==0.9.2 --extra-index-url https://download.pytorch.org/whl/cu128

Step 3 โ€” Install MineDraft:

pip install -e ".[benchmark]"

This installs:

  • Core: torch-scatter==2.1.2
  • Benchmark: datasets, nvitop, pandas, numpy, matplotlib, IPython, tqdm

Dataset Preparation

mkdir -p benchmarks/datasets
python scripts/convert_datasets.py
Output FileSource
ShareGPT.jsonShareGPT_V3_unfiltered_cleaned_split
arena.jsonLMSYS Chatbot Arena Conversations
spec_bench.jsonSpec-Bench
tough.jsonDomain-specific tough questions

Configuration

Experiments use various speculative decoding configurations set via --speculative-config:

{
    "method": null,
    // null = standard SD | "eagle" = EAGLE
    "model": "<draft_model>",
    // HuggingFace model ID for draft model
    "draft_tensor_parallel_size": 1,
    // TP size for draft model (always 1)
    "num_speculative_tokens": 5,
    // Number of draft tokens (k)
    "is_parallel": true,
    // Enable PSD (and MineDraft)
    "force_pearl": false,
    // Enable PEARL if is_parallel is true (disables MineDraft)
    "tetris": true,
    // Enable Tetris
    "tetris_turn_on_batch_size": 1,
    // Batch size threshold to activate Tetris
    "tetris_capacity": 0,
    // Tetris capacity, 0 โ†’ auto calculated from k ร— max_num_seqs
    "tetris_extra_proposals": 3
    // Extra draft tokens for Tetris
}

Hardware Layout

ModeGPUsLayout
Parallel54 for target model TP + 1 for draft model
Sequential4All 4 for target model TP; drafter shares resources

Running Experiments

Experiment Index

ScriptModel Setup
experiment_1_*.shQwen3-32B with draft models (0.6B, 1.7B, 4B)
experiment_2_eagle_*.shEAGLE โ€” Vicuna-33B, Vicuna-13B
experiment_2_llama_*.shLlama-3.3-70B-AWQ with Llama-3.1-8B
experiment_3_n_*.shMulti-sample ablation
experiment_4_bs_*.shBatch size ablation (8, 16, 32, 64)
experiment_5_tetris_*.shTetris VSR analysis
experiment_6_qwen8b.shQwen3-32B with Qwen3-8B
experiment_7_qwen235b.shQwen3-235B-A22B-FP8 with Qwen3-14B
experiment_8_nsys.shNVIDIA Nsight Systems profiling

Each experiment ships with two variants: *_parallel.sh (5 GPUs) and *_sequential.sh (4 GPUs).

Run All

cd scripts
bash run_all.sh        # parallel + sequential
bash run_parallel.sh   # parallel only
bash run_sequential.sh # sequential only

Run Individual

cd scripts
bash experiment_1_parallel.sh          # Qwen3-32B parallel
bash experiment_2_eagle_sequential.sh  # EAGLE sequential

GPU Bootstrap (optional)

Useful on shared clusters โ€” waits for GPUs to become free before launching:

# First, comment out the `export CUDA_VISIBLE_DEVICES=` line in the target script, then:
python scripts/bootstrap.py bash scripts/experiment_1_parallel.sh

The bootstrap script monitors GPU availability, waits until 5 GPUs are free (<1% memory & utilization), then sets CUDA_VISIBLE_DEVICES and launches. You can adjust required GPU count and thresholds in the main function.


Results & Analysis

ArtifactLocation
Benchmark tracesbenchmarks/trace/*.jsonl
Nsight Systems profiling reports*.nsys-rep (project root)
Trace analysis notebookbenchmarks/trace/analyze_plots.ipynb
Trace analysis utilitiesbenchmarks/trace/analyze_traces.py

Troubleshooting

Out of Memory (OOM)
  • Reduce --gpu-memory-utilization (default: 0.65)
  • Reduce --max-num-seqs (batch size)
  • Switch to a smaller draft or target model
CUDA Version Mismatch

Verify your CUDA installation:

nvcc --version
nvidia-smi

MineDraft requires CUDA โ‰ฅ 12.8 for the tested configuration.

Model Download Issues

Models are automatically downloaded from HuggingFace. Ensure you have:

  • Sufficient disk space or quota
  • HuggingFace access tokens for gated models (e.g., Llama)

For downloading gated models, run:

huggingface-cli login
NVIDIA Nsight Systems โ€” Wrong event order error

If you see:

Wrong event order has been detected when adding events to the collection

Upgrade to Nsight Systems โ‰ฅ 2024.2 from the NVIDIA developer portal.


Citation

If you find MineDraft useful in your research, please cite:

@inproceedings{tang2026minedraft,
  title     = {MineDraft: A Framework for Batch Parallel Speculative Decoding},
  author    = {Tang, Zhenwei and Verma, Arun and Zhou, Zijian and Wu, Zhaoxuan
                and Prakash, Alok and Rus, Daniela and Low, Bryan Kian Hsiang},
  booktitle = {Proceedings of the 43rd International Conference on Machine Learning},
  year      = {2026}
}