README.md
May 19, 2026 · View on GitHub
Updates
(04/13/2026) New Features:
- More standardized evaluation pipeline under lm-evaluation-harness. The old customized evaluation pipeline following Trace-RL is currently deprecated.
- Two draft modes for FreeDave:
batch_expanding: copy both cache and current block along the batch dimension; more memory overhead but less computation.tree_attention: copy only current block along the sequence dimension and share cache using a block tree attention mask; much less memory overhead (usually ~20MB) but more computation.- From our observation, although with more computation, tree_attention mode without frequent cache copy and reduction operations is usually a bit faster than batch_expanding mode.
- Threshold-based parallel decoding
- Previsouly, we followed Trace-RL and used the same configuration as them. Specifically, they use a temperature of 0.1 for threshold-based parallel decoding, which under this setting is not robust, ususally bringing a significant performance drop.
- We currently set the temperature to 0 for threshold-based parallel decoding by default. Under this configuration, threshold-based parallel decoding is more stable and closer to the results reported in Fast-dLLM.
- Additionally, FreeDave can currently build on threshold-based parallel decoding as well, using a subset-based verification rule. Threshold-based FreeDave can inherit the generation quality of threshold-based parallel decoding and further unlock higher decoding efficiency.
- Add support for LLaDA
- The
modeling_llada.pyfile on huggingface is not directly compatible with the cache implementation under our generation pipeline. We made some modifications.
- The
(09/23/2025) Current Supported Models: TraDo, SDAR, Dream
Overview
We propose FreeDave (Free Draft-and-Verification), a training-free and model-free fast decoding algorithm for diffusion language models, which achieves lossless parallel decoding with a theoretical guarantte to reproduce the same sequence generated by one-token-per-step decoding.
Specifically, FreeDave utlizes a pipeline of parallel-decoded candidate generation and verification, where the parallel decoded draft tokens at the current step can be verifed by the model itself at the next step without extra model forward or external modules.
The generation and verification of the draft candidates can be understood as byproducts during the one-token-per-step decoding without introducing extra cost, except for a slight memory overhead from the batch forward. Empirically, with a high potential, the inference will get an appreciable speedup while still maintaining the generation quality.
Experimental Results
Repo Structure
FreeDave/
├── chat/ # Chat demos
│ ├── full_attn_dlm.py # Full-attention DLMs (Dream/LLaDA)
│ └── block_attn_dlm.py # Block-attention DLMs (TraDo/SDAR)
├── generation/ # Core generation module
│ ├── core.py # Core decoding algorithms implementation
│ ├── attn_utils.py # Attention mask utilities
│ ├── cache_utils.py # KV cache utilities
│ ├── monitor_utils.py # Model forward monitor + debug utilities
│ └── sampling_utils.py # Sampling utilities
└── eval/ # evaluation under lm_eval framework
├── README.md
├── eval_dream.py # Dream wrapper with inference stats integrated
├── eval_llada.py # LLaDA wrapper with inference stats integrated
├── eval_trado.py # TraDo/SDAR wrapper with inference stats integrated
├── scripts/ # eval scripts
└── tasks/ # Customized lm-eval tasks
Quick Start
Environment Setup
conda create --name freedave python=3.10
source activate freedave
pip install torch==2.6.0
pip install --no-cache-dir \
https://github.com/Dao-AILab/flash-attention/releases/download/v2.7.4.post1/\
flash_attn-2.7.4.post1+cu12torch2.6cxx11abiFALSE-cp310-cp310-linux_x86_64.whl
pip install -r requirements.txt
Chat Examples
We provide some examples of multi-turn chat for a quick start.
python -m chat.full_attn_dlm --model_name <hf_model_path>
python -m chat.block_attn_dlm --model_name <hf_model_path>
Inference & Evaluations
Please see eval/README.md for more details. We have built a more standardized evaluation pipeline under lm-evaluation-harness. The old customized evaluation pipeline following Trace-RL is currently deprecated.
Citation
@article{wu2025free,
title={Free Draft-and-Verification: Toward Lossless Parallel Decoding for Diffusion Large Language Models},
author={Wu, Shutong Zhang, Jiawei},
journal={arXiv preprint arXiv:2510.00294},
year={2025}
}
Acknowledgement
This repository is heavily built on Trace-RL, and we have also referenced the following open-source projects:
and theoretical foundations: