Substitute Speculative Decoding (SubSpec)

January 16, 2026 · View on GitHub

This repository is the official implementation of "Speculate Deep and Accurate: Lossless and Training-Free Acceleration for Offloaded LLMs via Substitute Speculative Decoding".

fig1

Table of Contents

Requirements

First, create and activate a conda environment with the following command:

conda create -n subspec python=3.11
conda activate subspec

Then, install PyTorch from the official website.

Install the rest of the base requirements:

pip install "smolagents[toolkit]"
pip install -r requirements.txt

If you want to use the Gradio UI:

pip install gradio

You will need to install the additional libraries for quantization:

  • HQQ (Default)
pip install hqq
pip install gemlite==0.5.1.post1
  • HIGGS (optional)
pip install flute-kernel

# Install the fast-hadamard-transform library
git clone https://github.com/Dao-AILab/fast-hadamard-transform.git
cd fast-hadamard-transform
pip install -e .
cd ..

Usage

All entrypoints go through run.main by setting a YAML config (--config).

  • Method templates: configs/methods/
  • Migrated offloading experiment configs: configs/exp_offloading/

Precedence: method defaults < YAML < CLI.

# Run from a method YAML
python -m run.main --config configs/methods/classic_sd.yaml run-test

# Run from an exp_offloading YAML (includes a recipe + per-model offload settings)
python -m run.main --config configs/exp_offloading/vanilla_qwen_7b.yaml run-test

# Override YAML values from the CLI
python -m run.main --config configs/methods/classic_sd.yaml --device cuda:1 --warmup-iter 0 run-test

Verification is configured via generator_kwargs.verify_method and generator_kwargs.verify_kwargs.

Example (lossy tree verification). threshold_method selects the gate: entropy (paper) uses normalized entropy hj<θh_j < \theta; prob uses a probability gate on the target distribution. window_size requires at least that many exact-match tokens after a lossy-accepted child.

generator_kwargs:
  verify_method: lossy
  verify_kwargs:
		threshold_method: entropy
		threshold: 0.3
    window_size: 6

CLI equivalents: --verify-method lossy --threshold-method entropy --threshold <threshold> --window-size <window_size>.

Offloading YAML configs parameterize “how many layers remain on GPU” via:

recipe:
  class_path: specdecodes.helpers.recipes.offload.layer_offload:LayerOffloadRecipe
  kwargs:
    keep_first_n_layers_on_gpu: <N>

Profiling (Nsight Systems + NVTX)

This repo already contains NVTX ranges in the code. To capture them with Nsight Systems, enable profiling via YAML (or CLI):

nvtx_profiling: true
nsys_output: nsight_report

Then run as usual:

python -m run.main --config configs/methods/<method_name>.yaml run-test

You can also override via CLI:

python -m run.main --config configs/methods/<method_name>.yaml --nvtx-profiling --nsys-output my_report run-test

Detailed analysis

Enable detailed analysis logging via YAML (or --detailed-analysis/--no-detailed-analysis):

detailed_analysis: true

When enabled, extra per-step diagnostic data is stored in the JSONL output via wandb_logger.log_data["detailed_analysis"].


### Available Methods
The following methods are available (registered in `run/core/presets.py`):
- `subspec_sd`: Substitute Speculative Decoding (Offloading + HQQ Quantization)
- `classic_sd`: Standard Speculative Decoding
- `vanilla`: Base LLM inference (no speculative decoding)
- `eagle_sd`: EAGLE Speculative Decoding
- ...and others (`subspec_sd_v2`, `subspec_sd_no_offload`, etc.)

### Common Arguments
- `--config`: Path to a YAML config (required). CLI args override YAML.
- `--method`: Optional override for the YAML `method`.
- `--device`: Target device (e.g., `cuda:0`, `cuda:1`). Defaults to `cuda:0`.
- `--warmup-iter`: Number of warmup iterations. Default varies by method (typically 1).
- `--compile-mode`: Torch compile mode (e.g., `reduce-overhead`, `max-autotune`, or `none`). Defaults to `none`.

## Evaluation

```bash
# Quick sanity check
python -m run.main --config configs/methods/<method_name>.yaml run-test

# Detailed benchmark run
python -m run.main --config configs/methods/<method_name>.yaml run-benchmark --benchmarks <benchmarks> --max-samples 20

Examples

1. Evaluate SubSpec on MT-Bench with specific GPU:

python -m run.main --config configs/methods/subspec_sd.yaml --device "cuda:0" run-benchmark --benchmarks mt-bench --max-samples 20

2. Run a quick test with Classic SD on a different GPU:

python -m run.main --config configs/methods/classic_sd.yaml --device "cuda:1" --warmup-iter 0 run-test

Selectable benchmarks: "mt-bench", "human-eval", "gsm8k", "alpaca", "cnn-dm", "aime", "gpqa", "math-500", and "livecodebench".

The datasets and pretrained models will be downloaded automatically from Hugging Face.

Results

SubSpec achieves superior performance on various benchmarks.

Below is the result for accelerating Qwen2.5 7B with tree-based speculative decoding using different draft models, running 20 samples on MT-Bench:

Draft Modeltokens/secτ
EAGLE-27.563.90
Qwen2.5 1.5B15.1411.91
SubSpec24.2928.35

τ represents average acceptance length, which is the the mean number of the accepted draft tokens per iteration.

For EAGLE's draft model, you will need to download the pretrained model manually, then convert it with the 'convert_eagle_weights.ipynb' script before use.

Interfaces

Gradio Demo

Launch an interactive chat UI:

python -m run.main --config configs/methods/<method_name>.yaml run-gradio --host 127.0.0.1 --port 7860

To expose it on your network (or use a public share link):

python -m run.main --config configs/methods/<method_name>.yaml run-gradio --host 0.0.0.0 --port 7860
python -m run.main --config configs/methods/<method_name>.yaml run-gradio --share

OpenAI-Compatible API Server

This repo includes an OpenAI-compatible HTTP server implemented in run/pipelines/run_api.py.

Run the server

  1. Activate your environment:
conda activate ~/envs/subspec
  1. Start the server via the unified entry point:
python -m run.main --config configs/methods/<method_name>.yaml run-api --host 0.0.0.0 --port 8000

Example: eagle_sd

eagle_sd requires both a target model (--llm-path) and a draft model (--draft-model-path). The default draft path in the preset is:

~/checkpoints/eagle/official/EAGLE-Llama-3.1-8B-Instruct

Run:

python -m run.main \
	--config configs/methods/eagle_sd.yaml \
	--llm-path meta-llama/Llama-3.1-8B-Instruct \
	--draft-model-path ~/checkpoints/eagle/official/EAGLE-Llama-3.1-8B-Instruct \
	--device cuda:0 \
	--warmup-iter 0 \
	run-api --host 0.0.0.0 --port 8000

Quick API checks

curl http://127.0.0.1:8000/health
curl http://127.0.0.1:8000/v1/models

Chat completions (non-stream):

curl -s http://127.0.0.1:8000/v1/chat/completions \
	-H 'Content-Type: application/json' \
	-d '{"messages":[{"role":"user","content":"Say OK"}],"max_tokens":16,"temperature":0}' | jq .

Chat completions (stream):

curl -N http://127.0.0.1:8000/v1/chat/completions \
	-H 'Content-Type: application/json' \
	-d '{"messages":[{"role":"user","content":"Count to three"}],"max_tokens":32,"temperature":0,"stream":true}'

Testing

Run the unit tests with:

pytest -q

Some tests that run real model workloads are skipped by default. To enable them:

SUBSPEC_RUN_REAL_MODEL_TESTS=1 pytest -q