Balancing Coverage and Draft Latency in Vocabulary Trimming for Faster Speculative Decoding
March 8, 2026 · View on GitHub
This repository contains the reference implementation for the paper:
Balancing Coverage and Draft Latency in Vocabulary Trimming for Faster Speculative Decoding Paper link: https://arxiv.org/abs/2603.05210
Citation
If you find this paper useful or use this code, please cite:
@article{shoham2026balancing,
title={Balancing Coverage and Draft Latency in Vocabulary Trimming for Faster Speculative Decoding},
author={Shoham, Ofir Ben},
journal={arXiv preprint arXiv:2603.05210},
year={2026}
}
Overview
Speculative decoding accelerates LLM inference by using a lightweight draft model to propose candidate tokens verified in parallel by the target model. The draft model's language modeling (LM) head, which projects hidden states to vocabulary logits, often dominates draft latency — accounting for over 60% of FLOPs for models like LLaMA-3-8B.
We formulate draft vocabulary selection as a constrained optimization problem that balances token coverage against draft model latency:
where the utility function combines coverage and latency reduction:
Token coverage is computed over assistant response tokens only (matching training loss masks), and latency reduction is estimated using architecture-aware FLOPs. The optimization is performed using Optuna's Tree-structured Parzen Estimator (TPE).
Installation
git clone https://github.com/Ofir408/Balancing-Coverage-and-Draft-Latency-in-Vocabulary-Trimming-for-Faster-Speculative-Decoding.git
cd Balancing-Coverage-and-Draft-Latency-in-Vocabulary-Trimming-for-Faster-Speculative-Decoding
pip install -r requirements.txt
Usage
Basic Usage
python optimize_vocab_size.py \
--dataset-path ./data/train.jsonl \
--tokenizer-path meta-llama/Llama-3.1-8B-Instruct \
--chat-template llama3 \
--n-calls 100
Arguments
| Argument | Default | Description |
|---|---|---|
--dataset-path | required | Path to training dataset (JSONL with conversations field) |
--tokenizer-path | required | HuggingFace tokenizer path or name |
--chat-template | llama3 | Chat template (llama3, llama4, qwen, deepseek, phi3, phi4, gemma) |
--is-preformatted | False | Set if dataset already contains formatted text |
--coverage-weight | 0.5 | Weight for coverage in utility function () |
--min-coverage | 0.0 | Minimum acceptable token coverage (0.0 = no constraint) |
--min-vocab-size | 50 | Lower bound of search space |
--max-vocab-size | 50000 | Upper bound of search space |
--n-calls | 30 | Number of TPE optimization trials |
--sample-size | all | Subset of dataset to use |
--config-path | None | Path to draft model config JSON to auto-update draft_vocab_size |
--output-dir | ./cache/vocab_optimization | Directory for results and plots |
--random-state | 42 | Seed for reproducibility |
Dataset Format
The script expects JSONL files with a conversations field:
{"conversations": [{"role": "user", "content": "..."}, {"role": "assistant", "content": "..."}]}
Or preformatted text (use --is-preformatted):
{"text": "<|start_header_id|>user<|end_header_id|>\n\nHello<|eot_id|><|start_header_id|>assistant<|end_header_id|>\n\nHi!<|eot_id|>"}
Output
The script produces:
summary.json— Optimal vocabulary size, coverage, latency reduction, and utility score.full_results.json— All evaluated configurations with detailed metrics.optimization_results.png— Four-panel visualization:- Coverage vs. vocabulary size
- Latency reduction vs. vocabulary size
- Utility score vs. vocabulary size
- Pareto front (coverage vs. latency)
Using the Result
After optimization, use the draft_vocab_size in your EAGLE-3 draft model training config. For example:
{
"hidden_size": 4096,
"vocab_size": 128256,
"draft_vocab_size": 13264
}
For training the draft model with SpecForge:
python scripts/train_eagle3.py --config configs/your_config.json
Reproducing Paper Results
Open-PerfectBlend
To reproduce the vocabulary optimization that yields a 13,264-token draft vocabulary:
python optimize_vocab_size.py \
--dataset-path /path/to/open-perfectblend.jsonl \
--tokenizer-path meta-llama/Llama-3.1-8B-Instruct \
--chat-template llama3 \
--sample-size 50000 \
--n-calls 100 \
--random-state 42 \
--output-dir ./results/open-perfectblend