README.md
June 26, 2025 ยท View on GitHub
Turning Trash into Treasure: Accelerating Inference of Large Language Models with Token Recycling
| Paper | Slide | Poster | Video |
News ๐ฅ
- [2025/6] Token Recycling is selected as Oral presentation at ACL2025! See you in Vienna! ๐ฆ๐น
- [2025/5] Token Recycling is accepted by ACL2025!
Contents
Introduction
We introduce Token Recycling:
- A plug-and-play speculative decoding method to accelerate LLM inference~
- Only requires minimal storage space (<2MB), without the need for a draft model or a data store!
- Achieves approximately 2x speedup on all sizes of LLMs!
Background: Retrieval-Based Speculative Decoding
Traditional speculative decoding methods face significant limitations:
- Model-based require training and maintaining additional models, increasing complexity and computational overhead.
- Retrieval-based need large-scale stores, time-consuming retrieval or suffer from a retrieval space that lacks flexibility and adaptability.
Token Recycling: Turning Trash into Treasure! ๐
LLMs generate a probability distribution over the vocabulary at each step, yet existing methods keep only the sampled token, treating all others as 'trash'. We observe that these other high-probability candidate tokens frequently appear in subsequent generations, making them a valuable but overlooked 'treasure'.
Based on the oberservation, we propose Token Recycling (TR). It utilizes an adjacency matrix to store candidate tokens for every token in the vocabulary. In each generation step, a Breadth-First Search (BFS) if performed on this matrix, starting from the last token of current input, to generate a draft tree. After model forward the tree attention and verification, the newly generated candidate tokens are used to update the adjacency matrix. In this way, 'trash' tokens are turned into a 'treasure'!

Experimental Results ๐

On SpecBench, TR achieves more than a 2x speedup on the 7B model, nearly 30% higher than the previous train-free methods. Even compared to tuning Medusa, it shows an improvement of almost 25%. For the 13B and 33B models,it consistently provides nearly 2x speedup, maintaining the 30% acceleration advantage.
| Method | Memory (MB) | Tokens/s | Speed |
|---|---|---|---|
| Eagle1 | 500 | 106.94 | 2.08 |
| Eagle2 | 500 | 116.95 | 2.28 |
| TR | 1.95 | 107.52 | 2.09 |
TR is also compared with the SOTA speculative decoding methods - Eagle1/2 on SpecBench. It is a surprise that, despite being completely training-free and self-drafting, TR outperforms Eagle1. Moreover, TR achieves 91.23% of the acceleration performance of Eagle2 while only requiring 0.39% of the memory used by Eagle2. It is worth noting that TR still employs a static tree, rather than the dynamic tree used in Eagle2. This highlights the remarkable effectiveness and efficiency of TR and shows potential for further improvement when combine TR with dynamic trees.
Usage
Install
This codebase is built on the SpecBench, no other package are needed excepted for those used by Spec-Bench.
git clone https://github.com/Luowaterbi/TokenRecycling.git
cd TokenRecycling
pip install -r requirements.txt
Note: Make sure you have the required models downloaded. The evaluation script expects models in ../models/ directory relative to the project root.
Project Structure
TokenRecycling/
โโโ README.md # Project documentation
โโโ requirements.txt # Python dependencies
โโโ eval.sh # Quick evaluation script
โโโ data/ # Dataset files
โโโ evaluation/
โ โโโ inference_recycling.py # Main inference implementation
โ โโโ eval.py # Evaluation utilities
โโโ model/
โโโ recycling/
โโโ tree_template_.py # Tree structure definitions
โโโ kv_cache.py # Key-value cache management
โโโ modeling_llama_kv.py # LLaMA model modifications
Inference
Quick Start
bash eval.sh
Advanced Usage
You can customize the inference parameters:
python -m evaluation.inference_recycling \
--model-path /path/to/your/model \
--model-id your-model-id \
--bench-name spec_bench \
--dtype float16 \
--output-id-topk 8 \
--tree-version "2.2.2" \
--temperature 0.0
Candidate Storage Configuration ๐ ๏ธ
You can customize the number of candidate tokens stored for each token in the vocabulary using the --output-id-topk argument. This parameter controls how many high-probability "treasure" tokens are collected from the output distribution to build and update the adjacency matrix.
For implementation details, you can refer to evaluation/inference_recycling.py.
Tree Structure Configuration ๐ณ
Token Recycling uses a fixed tree template (version "2.2.2" by default) to organize candidate tokens in a tree structure.

Tree Representation:
- Root: Represents the current input's last token.
- Index Meaning: Numbers on the tree represent the node's index position in its parent's candidate list (0-7 for top-k=8).
Root (current input's last token.)
โโโ [0] โ 0th candidate token
โ โโโ [0,0] โ 0th candidate after [0]
โ โ โโโ [0,0,0] โ 0th candidate after [0,0]
โ โโโ [0,1] โ 1st candidate after [0]
โโโ [1] โ 1st candidate token
โ โโโ [1,0] โ 0th candidate after [1]
โโโ [2] โ 2nd candidate token
Tree Template Implementation:
The tree structure is defined in model/recycling/tree_template_.py.
- Each node is represented by a path from root:
[depth_0_index, depth_1_index, ...][0]: The 0th child of root[0,1]: The 1st child of root's 0th child[0,1,2]: The 2nd child of the node at path[0,1]
- The current default template "2.2.2" has 80 nodes (except root) organized in 6 levels:
- Level 1: 8 nodes
[0], [1], ..., [7] - Level 2: 21 nodes like
[0,0], [0,1], ..., [7,0] - Level 3: 25 nodes like
[0,0,0], [0,0,1], ..., [5,0,0] - Level 4: 15 nodes like
[0,0,0,0], [0,0,0,1], ..., [3,0,0,0] - Level 5: 8 nodes like
[0,0,0,0,0], [0,0,0,0,1], ..., [1,0,0,0,0] - Level 6: 3 nodes
[0,0,0,0,0,0], [0,0,0,0,0,1], [0,0,0,1,0,0] - Path
[0,1,2]means: "From root, take the 0th candidate, then from that token take the 1st candidate, then from that token take the 2nd candidate"
- Level 1: 8 nodes
How to Customize Tree Structure:
-
Add new tree version with different branching factors:
if version == "mine": # More candidates at each level tree_template = [ [0], [1], [2], [3], [4], # 5 candidates at depth 1 [0,0], [0,1], [1,0], [1,1], # 4 candidates at depth 2 # ... continue pattern ] -
Use your custom tree:
python -m evaluation.inference_recycling \ --tree-version "mine" \ # ... other parameters
Todo
We are actively working to expand Token Recycling's compatibility and ease of use. Contributions are welcome!
- Adapt to newer models (currently based on Vicuna-1.3)
- Transformers
model.generate()support - vLLM & SGLang integration
Seek for help: If you have experience with integration and would like to contribute, please contact us! Your help would be greatly appreciated! ๐
Citation
@article{luo2024turning,
title={Turning trash into treasure: Accelerating inference of large language models with token recycling},
author={Luo, Xianzhen and Wang, Yixuan and Zhu, Qingfu and Zhang, Zhiming and Zhang, Xuanyu and Yang, Qing and Xu, Dongliang},
journal={arXiv preprint arXiv:2408.08696},
year={2024}
}