SparseD: Sparse Attention for Diffusion Language Models

February 22, 2026 · View on GitHub

demo

SparseD: Sparse Attention for Diffusion Language Models 🥯[Arxiv]
Zeqing Wang¹, Gongfan Fang¹, Xinyin Ma¹ , Xingyi Yang² , Xinchao Wang¹
¹ xML Lab, National University of Singapore
² The Hong Kong Polytechnic University

📚 Introduction

SparseD is a novel sparse attention method for diffusion language models (DLMs), delivering near lossless acceleration in performance. It uses full attention and computes sparse patterns during early denoising steps, then reuses these patterns in later steps to restrict computation and improve efficiency. Extensive experiments show that SparseD greatly maintains accuracy on the evaluated benchmarks while achieving up to $1.50\times$ speedup at a 64k context length with 1,024 steps.

The overview of SparseD

🛠️ Setup

conda create -n SparseD python=3.10
conda activate SparseD
pip install -r requirements.txt

🚀 Usage

# For Dream Model
python dream_generation.py --origin   
python dream_generation.py --skip 0.2 --select 0.3 --block_size 128 --prompt 4k
python dream_generation.py --skip 0.2 --select 0.5 --block_size 32 --prompt short_context

# For LLaDA Model
python llada_generation.py --origin 
python llada_generation.py --skip 0.2 --select 0.3 --block_size 128 --prompt 4k
python llada_generation.py --skip 0.2 --select 0.5 --block_size 32 --prompt short_context

Arguments:

--model_path: The model path, e.g., Dream-org/Dream-v0-Instruct-7B and GSAI-ML/LLaDA-1.5 .
--seq_len, --steps, --block_length, --sampling-alg: The inference configuration for diffusion generation. block_length refers specifically to the LLaDA model.
--origin: Inference by original model.
--skip, --select, --block_size: The inference configuration for SparseD. skip denotes the ratio of full attention applied to the earlier steps across all denoising steps, select denotes the selection ratio for sparse attention, and block_size specifies the block size used when selecting important query-key pairs.
--prompt: Choose prompt for simple test, including ["short_context", "4k", "8k", "16k", "32k", "64k"] length versions.

📑 Results

1. Accuracy

2. Latency

☀️ Note

Since our sparse attention is implemented by FlexAttention, we recommend conducting a warm-up inference first, as subsequent inferences will perform better in terms of speed.
To better demonstrate the acceleration achieved by SparseD, we recommend evaluating it with long-context prompts, such as those with lengths of 16k, 32k, and 64k. We also provide short context for simple evaluation.

🤓 Acknowledgments

Our sparse attention is accelerated by FlexAttention, and implemented on Dream and LLaDA. We extend our gratitude to the community for their valuable contributions!

🔗 Citation

@inproceedings{
  wang2026sparsed,
  title={SparseD: Sparse Attention for Diffusion Language Models},
  author={Zeqing Wang and Gongfan Fang and Xinyin Ma and Xingyi Yang and Xinchao Wang},
  booktitle={The Fourteenth International Conference on Learning Representations},
  year={2026},
  url={https://openreview.net/forum?id=dwbrZtYP04}
}