README.md

October 16, 2025 · View on GitHub

Sparse-dLLM: Accelerating Diffusion LLMs with Dynamic Cache Eviction

Yuerong Song^1,2, Xiaoran Liu^1,2, Ruixiao Li^1,2, Zhigeng Liu^1,2, Zengfeng Huang^1,2, Qipeng Guo^2,3, Ziwei He^2,†, Xipeng Qiu^1,2,†

¹ Fudan Univerisity, ²Shanghai Innovation Institute, ³Shanghai AI Laboratory

[📝 Paper] | [🤗 HF] | [🚀 Code]

Introduction

In this work, we present Sparse-dLLM, a training-free framework that tackles the core bottleneck of diffusion large language models (dLLMs): quadratic-time computational complexity. While prior caching methods accelerate dLLMs by reusing full-layer KV states, they incur substantial memory overhead that constrains long-context applications. Our analysis reveals a distinctive property of dLLM attention—persistent cross-layer sparsity with stable token saliency over decoding steps—suggesting that many cached entries are low-relevance and can be safely discarded.

Building on these observations, we integrate dynamic cache eviction with sparse attention via a delayed bidirectional sparse caching strategy. Sparse-dLLM retains pivotal tokens and dynamically evicts unimportant prefix and suffix entries using an attention-guided strategy, while delaying cache updates by one step to stabilize selection. This plug-and-play design prunes redundant cache states without retraining, accelerates dLLM decoding, and preserves a near-identical peak memory footprint compared with vanilla dLLMs, enabling practical long-context inference.

On LLaDA and Dream series, Sparse-dLLM delivers up to 10× higher throughput than vanilla dLLMs, maintaining comparable performance and outperforming recent dLLM caching methods in efficiency–effectiveness trade-off. Our study thus establishes the first method that combines dynamic cache eviction with sparse attention for dLLMs, and provides empirical evidence and analysis that chart a path toward scalable, fast, and memory-efficient dLLM decoding.

Installation

Prepare Your OpenCompass

We run our downstream evaluation based on OpenCompass.

git clone https://github.com/open-compass/opencompass
cd opencompass
pip install -e .

The necessary Python packages we use and their corresponding versions.

opencompass==0.4.2
torch==2.6.0
transformers==4.46.3

Prepare Your Model and Benchmarks

Copy the directory Sparse-dLLM/opencompass/to your OpenCompass directory and add the following lines to the end of opencompass/models/__init__.py.

from .sparse_dllm.llada_wrapper import Sparse_dLLM_LLaDACausalLM
from .sparse_dllm.dream_wrapper import Sparse_dLLM_DreamCausalLM
from .sparse_dllm.dream_wrapper_instruct import Sparse_dLLM_DreamCausalLMInstruct

Evaluation

Copy the directory Sparse-dLLM/myeval/ to your OpenCompass directory and then you can try the following evaluations.

Performance Evaluation

Go to your OpenCompass directory and run performance evaluation:

opencompass run.py myeval/eval_performance/eval_sparse_dllm_***.py

Replace *** with the corresponding model name (e.g., dream_base, dream_chat, llada_chat, llada_1.5).

Speed Evaluation

Go to your OpenCompass directory and run the corresponding script. For example:

bash myeval/eval_speed/eval_speed_dream_example.sh
bash myeval/eval_speed/eval_speed_llada_example.sh

Or run the Python code directly (with parameters):

python myeval/eval_speed/dream_sparse_dllm.py --model_path <MODEL_PATH> --model_type <MODEL_TYPE> --data_path <DATA_PATH> --data_type <DATA_TYPE> --output_dir <OUTPUT_DIR> --kernel_size 3 --keep_ratio 0.5 --block_length 32 --apply_chat_template True

See codes for more details.

Results

Citation

@article{song2025sparse,
  title={Sparse-dllm: Accelerating diffusion llms with dynamic cache eviction},
  author={Song, Yuerong and Liu, Xiaoran and Li, Ruixiao and Liu, Zhigeng and Huang, Zengfeng and Guo, Qipeng and He, Ziwei and Qiu, Xipeng},
  journal={arXiv preprint arXiv:2508.02558},
  year={2025}
}