ACL2026: One Token Is Enough: Improving Diffusion Language Models with a Sink Token

April 7, 2026 · View on GitHub

Diffusion Language Models (DLMs) have emerged as a compelling alternative to autoregressive approaches, enabling parallel text generation with competitive performance. Despite these advantages, there is a critical instability in DLMs: the moving sink phenomenon. Our analysis indicates that sink tokens exhibit low-norm representations in the Transformer's value space, and that the moving sink phenomenon serves as a protective mechanism in DLMs to prevent excessive information mixing. However, their unpredictable positions across diffusion steps undermine inference robustness. To resolve this, we propose a simple but effective extra sink token implemented via a modified attention mask. Specifically, we introduce a special token constrained to attend solely to itself, while remaining globally visible to all other tokens. Experimental results demonstrate that introducing a single extra token stabilizes attention sinks, substantially improving model performance. Crucially, further analysis confirms that the effectiveness of this token is independent of its position and characterized by negligible semantic content, validating its role as a robust and dedicated structural sink.

News

Our paper is available!

Code for training DLM form scratch(based on SMDM) now released!

We thank Attention-Floating (Dai et al., 2026) for proposing a theory on the relationship between attention sink and in-context learning in diffusion language models. We conducted experiments related to in-context learning and found that the sink does indeed affect the model’s in-context learning ability, which support their theory. The relevant details will be added to the revision.

Our paper is accepted to acl2026 (findings)!

# e.g., 472M(0.5B) non-embedding parameters MDM and 300e18 training FLOPs, 8 GPUs
lightning run model \
    --node-rank=0  \
    --accelerator=cuda \
    --devices=8 \
    --num-nodes=1 \
    pretrain/train_mdm.py --model 472 --flops 300.

Pretrain DLMs with element-wise Gated Attention

# e.g., 472M original non-embedding parameters + extra param for Gated Attention, same training tokens
lightning run model \
    --node-rank=0  \
    --accelerator=cuda \
    --devices=8 \
    --num-nodes=1 \
    pretrain/train_mdm_gate.py --model 472 --flops 300.

Pretrain DLMs with extra token

lightning run model \
    --node-rank=0  \
    --accelerator=cuda \
    --devices=8 \
    --num-nodes=1 \
    pretrain/train_mdm_extratoken.py --model 472 --flops 300.

Evaluate DLMs after Pretraining

Commonsense reasoning and reading comprehension

We use the famous lm-evaluation-harness framework for evaluation.

Please make sure the modeling file in the evaluation code has been switched to the corresponding mode (diffmodel.py, diffmodel_extratoken.py, diffmodel_gate.py) before using evaluation commands!!!

We provide the running commands in eval_mdm.sh, eval_mdm_gate.sh and eval_mdm_extratoken.sh.

Supervised fine-tuning

Math reasoning

Please download the augmented training data and put the train.txt file in ./data/gsm8k.

lightning run model \
    --node-rank=0  \
    --accelerator=cuda \
    --devices=8 \
    --num-nodes=1 \
    sft/finetune_mdm_gsm8k.py --model 472 --pretrain_path yourpathtosavemodel

lightning run model \
    --node-rank=0  \
    --accelerator=cuda \
    --devices=8 \
    --num-nodes=1 \
    sft/finetune_mdm_gsm8k_gate.py --model 472 --pretrain_path yourpathtosavemodel

lightning run model \
    --node-rank=0  \
    --accelerator=cuda \
    --devices=8 \
    --num-nodes=1 \
    sft/finetune_mdm_gsm8k_extratoken.py --model 472 --pretrain_path yourpathtosavemodel

Math reasoning

Please download the GSM8K test data and put the test.jsonl into ./data/gsm8k

python evaluate_gsm8k.py --ckpt_path "your_path_to_sftmodel"

python evaluate_gsm8k_gate.py --ckpt_path "your_path_to_sftmodel"

python evaluate_gsm8k_extratoken.py --ckpt_path "your_path_to_sftmodel"

Citation

If you find our work useful in your research, please consider citing and star our repository:

@article{zhang2026one,
  title={One Token Is Enough: Improving Diffusion Language Models with a Sink Token},
  author={Zhang, Zihou and Xie, Zheyong and Zhong, Li and Liu, Haifeng and Cao, Shaosheng},
  journal={arXiv preprint arXiv:2601.19657},
  year={2026}
}