OPT-Tree: Speculative Decoding with Adaptive Draft Tree Structure
May 24, 2025 ยท View on GitHub
๐๐OPT-Tree has been accepted has been accepted by TACL.
Contents
Introduction
OPT-Tree
We propose an adaptive and scalable draft tree structure in speculative decoding, which supports for any autoregressive draft models. More than 10 tokens can be generated in a single decoding step with OPT-Tree.
An example is shown below:
Blue tokens are drafted by llama-2-chat-7b and verified by llama-2-chat-70b in a single decoding step. Red tokens are generated by llama-2-chat-70b.
Installation
pip install -r requirements.txt
Demo
With independent draft models
export CUDA_VISIBLE_DEVICES=0 #Also support for multiple GPUs
python demo_opt_classic.py
With EAGLE draft models
export CUDA_VISIBLE_DEVICES=0 #Also support for multiple GPUs
python demo_opt_eagle.py
Evaluation on datasets
With independent draft models
export CUDA_VISIBLE_DEVICES=0,1,2,3
python -m evaluation.eval_opt_classic \
--draft-model-path JackFram/llama-68m \
--base-model-path meta-llama/Llama-2-7b-chat-hf \
--bench-name mt_bench \
--answer-file ./mt_classic_opt.jsonl \
--temperature 0 \
--nodes 60 \
--threshold 0.5 \
--max_depth 10
With EAGLE draft models
EAGLE draft models can be downloaded from https://github.com/SafeAILab/EAGLE.
export CUDA_VISIBLE_DEVICES=0,1,2,3
python -m evaluation.eval_opt_eagle \
--ea-model-path yuhuili/EAGLE-llama2-chat-7B \
--base-model-path meta-llama/Llama-2-7b-chat-hf \
--bench-name mt_bench \
--answer-file ./mt_eagle_opt.jsonl \
--temperature 0 \
--nodes 60 \
--threshold 0.5 \
--max_depth 10
Citation
@article{wang2025opt,
title={Opt-tree: Speculative decoding with adaptive draft tree structure},
author={Wang, Jikai and Su, Yi and Li, Juntao and Xia, Qingrong and Ye, Zi and Duan, Xinyu and Wang, Zhefeng and Zhang, Min},
journal={Transactions of the Association for Computational Linguistics},
volume={13},
pages={188--199},
year={2025},
publisher={MIT Press 255 Main Street, 9th Floor, Cambridge, Massachusetts 02142, USA~โฆ}
}