README.md

May 19, 2026 · View on GitHub

Free Draft-and-Verification: Toward Lossless Parallel Decoding for Diffusion Large Language Models

Updates

(04/13/2026) New Features:

More standardized evaluation pipeline under lm-evaluation-harness. The old customized evaluation pipeline following Trace-RL is currently deprecated.
Two draft modes for FreeDave:
- batch_expanding: copy both cache and current block along the batch dimension; more memory overhead but less computation.
- tree_attention: copy only current block along the sequence dimension and share cache using a block tree attention mask; much less memory overhead (usually ~20MB) but more computation.
- From our observation, although with more computation, tree_attention mode without frequent cache copy and reduction operations is usually a bit faster than batch_expanding mode.
Threshold-based parallel decoding
- Previsouly, we followed Trace-RL and used the same configuration as them. Specifically, they use a temperature of 0.1 for threshold-based parallel decoding, which under this setting is not robust, ususally bringing a significant performance drop.
- We currently set the temperature to 0 for threshold-based parallel decoding by default. Under this configuration, threshold-based parallel decoding is more stable and closer to the results reported in Fast-dLLM.
- Additionally, FreeDave can currently build on threshold-based parallel decoding as well, using a subset-based verification rule. Threshold-based FreeDave can inherit the generation quality of threshold-based parallel decoding and further unlock higher decoding efficiency.
Add support for LLaDA
- The modeling_llada.py file on huggingface is not directly compatible with the cache implementation under our generation pipeline. We made some modifications.

(09/23/2025) Current Supported Models: TraDo, SDAR, Dream

We propose FreeDave (Free Draft-and-Verification), a training-free and model-free fast decoding algorithm for diffusion language models, which achieves lossless parallel decoding with a theoretical guarantte to reproduce the same sequence generated by one-token-per-step decoding.

Specifically, FreeDave utlizes a pipeline of parallel-decoded candidate generation and verification, where the parallel decoded draft tokens at the current step can be verifed by the model itself at the next step without extra model forward or external modules.

The generation and verification of the draft candidates can be understood as byproducts during the one-token-per-step decoding without introducing extra cost, except for a slight memory overhead from the batch forward. Empirically, with a high potential, the inference will get an appreciable speedup while still maintaining the generation quality.

Experimental Results

Repo Structure

FreeDave/
├── chat/                   # Chat demos
│   ├── full_attn_dlm.py     # Full-attention DLMs (Dream/LLaDA)
│   └── block_attn_dlm.py    # Block-attention DLMs (TraDo/SDAR)
├── generation/             # Core generation module
│   ├── core.py              # Core decoding algorithms implementation
│   ├── attn_utils.py        # Attention mask utilities
│   ├── cache_utils.py       # KV cache utilities
│   ├── monitor_utils.py     # Model forward monitor + debug utilities
│   └── sampling_utils.py    # Sampling utilities
└── eval/                   # evaluation under lm_eval framework
    ├── README.md
    ├── eval_dream.py        # Dream wrapper with inference stats integrated
    ├── eval_llada.py        # LLaDA wrapper with inference stats integrated
    ├── eval_trado.py        # TraDo/SDAR wrapper with inference stats integrated
    ├── scripts/             # eval scripts
    └── tasks/               # Customized lm-eval tasks

Quick Start

Environment Setup

conda create --name freedave python=3.10
source activate freedave
pip install torch==2.6.0
pip install --no-cache-dir \
  https://github.com/Dao-AILab/flash-attention/releases/download/v2.7.4.post1/\
flash_attn-2.7.4.post1+cu12torch2.6cxx11abiFALSE-cp310-cp310-linux_x86_64.whl
pip install -r requirements.txt

Chat Examples

We provide some examples of multi-turn chat for a quick start.

For full-attention models like LLaDA and Dream:

python -m chat.full_attn_dlm --model_name <hf_model_path>

For block-attention models like TraDo and SDAR,:

python -m chat.block_attn_dlm --model_name <hf_model_path>

Inference & Evaluations

Please see eval/README.md for more details. We have built a more standardized evaluation pipeline under lm-evaluation-harness. The old customized evaluation pipeline following Trace-RL is currently deprecated.

Citation

@article{wu2025free,
  title={Free Draft-and-Verification: Toward Lossless Parallel Decoding for Diffusion Large Language Models}, 
  author={Wu, Shutong Zhang, Jiawei},
  journal={arXiv preprint arXiv:2510.00294},
  year={2025}
}

Acknowledgement

This repository is heavily built on Trace-RL, and we have also referenced the following open-source projects:

and theoretical foundations: