README.md

January 29, 2026 · View on GitHub

logo version license preprint ICLR PyTorch stars

This is the Pytorch implementation for our ICLR'26 paper: Scaling Up, Speeding Up: A Benchmark of Speculative Decoding for Efficient LLM Test-Time Scaling.

Framework

Abstract

Test-time scaling has emerged as a powerful paradigm for enhancing the reasoning capabilities of large language models (LLMs) by allocating additional computational resources during inference. However, this paradigm is inherently inefficient due to the generation of redundant and repetitive reasoning traces, leading to significant computational overhead. Speculative decoding offers a promising avenue for mitigating this inefficiency, yet its efficacy in the structured, repetition-rich context of test-time scaling remains largely unexplored. To bridge this gap, we introduce the first comprehensive benchmark designed to evaluate speculative decoding methods for accelerating LLM test-time scaling. Our benchmark provides consistent experimental protocols across representative test-time scaling paradigms (e.g., Best-of-N sampling and multi-round thinking), enabling a fair comparison of three major categories of speculative decoding: model-based, training-based, and n-gram-based methods. Extensive experiments reveal that simple n-gram-based methods effectively capture repetitive patterns, demonstrating unique potential in accelerating test-time scaling. This phenomenon demonstrates the value of integrating n-gram-based methods with model-based or training-based approaches to balance acceleration for both repetitive and diverse reasoning in test-time scaling. We hope this benchmark spurs further research on speculative decoding for test-time scaling, enabling faster and more practical reasoning in LLMs through better handling of repetitive and diverse reasoning paths.

Framework

Currently, SpecTTS-Bench supports the evaluation of the following open source models: EAGLE-3, Speculative Sampling, Prompt Lookup Decoding, TokenRecycling, REST, Lookahead Decoding, PIA, SAM-Decoding, SAM[EAGLE-3].

Requirement

Install neccesary packages.

conda create -n specTTS python=3.10
torch==2.1.1+cu121
transformers==4.43.1 # for DeepSeek-R1-Distill-Llama-8B
transformers==4.53.1 # for Qwen3 series

More details about the environment are provided in ./code/environment_deepseek.txt and ./code/environment_qwen3.txt.

Code Structure

SpecTTS-Bench/
├── code/                        # Core implementation directory
│   ├── scripts/                 # Shell scripts to execute the benchmarks
│   │   ├── deepseek.sh          # 🚀 Run here: Script for DeepSeek-R1-Distill-Llama-8B
│   │   └── qwen3.sh             # 🚀 Run here: Script for Qwen3 series
│   ├── model/                   # Speculative decoding methods
│   ├── evaluation/              # Launching inference with speculative decoding
│   ├── data/                    # Reasoning Dataset
│   ├── environment_deepseek.txt # Python dependency requirements for DeepSeek-R1-Distill-Llama-8B
│   └── environment_qwen3.txt    # Python dependency requirements for Qwen3 series
├── fig/                         # Figures and images for the README/Paper
├── LICENSE                      # MIT License
└── README.md                    # Main project documentation

Run

cd code
bash scripts/deepseek.sh # bash scripts/qwen3.sh

We provide the checkpoints for REST here.

Model Weight

Download corresponding model weights (if required) and modify the checkpoint path in code/scripts/*.

ModelTypeURL
DeepSeek-R1-Distill-Llama-8BTarget ModelLink
Qwen3-4BTarget ModelLink
Qwen3-8BTarget ModelLink
Qwen3-14BTarget ModelLink
Qwen3-0.6BDraft ModelLink
EAGLE3-DeepSeek-R1-Distill-LLaMA-8BDraft ModelLink
EAGLE3-Qwen3-4BDraft ModelLink
EAGLE3-Qwen3-8BDraft ModelLink
EAGLE3-Qwen3-14BDraft ModelLink

Acknowledgment of Open-Source Code Contributions

The code is based on the open-source repositories: Spec-Bench, EAGLE, and Medusa, many thanks to the authors!

You are welcome to cite our paper:

@inproceedings{SunLi25,
  title={Scaling Up, Speeding Up: A Benchmark of Speculative Decoding for Efficient LLM Test-Time Scaling},
  author={Shengyin Sun, Yiming Li, Xing Li, Yingzhao Lian, Weizhe Lin, Hui-Ling Zhen, Zhiyuan Yang, Chen Chen, Xianzhi Yu, Mingxuan Yuan, Chen Ma},
  booktitle={arXiv:2509.04474},
  year={2025}
}