README.md

March 4, 2026 · View on GitHub

Why Diffusion Language Models Struggle with Truly Parallel Decoding?

🤗 Hugging Face Dataset

Pengxiang Li*1    Dilxat Muhtar*2,3,4    Tianlong Chen6   Lu Yin*5   Shiwei Liu*2,3,4

1 The Hong Kong Polytechnic University, 2 ELLIS Institute Tübingen,
3 Max Planck Institute for Intelligent Systems, 4 Tübingen AI Center,
5 University of Surrey, 6 The University of North Carolina at Chapel Hill

[Paper] | [Blog]

Demo

Overview

Diffusion Language Models (DLMs) are often described as naturally parallel decoders. In practice, many decoding trajectories still collapse into autoregressive (AR-like) left-to-right behavior.

This project presents NAP (Non-Autoregressive Parallel DLMs), a data-decoding co-design framework:

  1. Parallel data curation: each training sample contains multiple independent reasoning trajectories, not a single privileged chain.
  2. Parallel-forced decoding: generation updates are enforced across multiple reasoning streams at every decoding step.

The goal is to reduce AR-like decoding bias while preserving (or improving) reasoning accuracy.

Key Diagnosis

  • Common corpora (FineWeb, OpenR1-Math) are strongly sequential.
  • AO (confidence-based arbitrary-order) decoding in standard DLMs still shows high ARness.
  • Long-CoT SFT further increases ARness.
  • Existing fast DLM decoding methods often improve speed by following an AR-like critical path.

TODOs

We will try our best to achieve

  • [✅] Training code of NAP
  • [🚀] Datasets and Model weights
  • [🚀] Inference and evaluation code

Method

NAP uses a structured output canvas:

[<think #1>, R^(1), <think #2>, R^(2), ..., <think #m>, R^(m), <summary>, S]
  • R^(j) are independent reasoning paths.
  • <summary> aggregates evidence from all paths and outputs the final answer.

Decoding policy: Image

  • Macro-parallel: distribute unmasking budget across all reasoning blocks each step.
  • Micro-confidence: within each block, commit tokens by confidence (not strict left-to-right order).

Main Quantitative Results

GSM8K (Accuracy, %)

StepsTok/StepLLaDA Long-CoTNAP-LLaDAGainDream Long-CoTNAP-DreamGain
256454.156.1+2.046.560.9+14.4
336360.963.3+2.456.970.9+14.0
512282.082.6+0.666.879.2+12.4
1024183.584.1+0.678.083.6+5.6

MATH-500 (Accuracy, %)

StepsTok/StepLLaDA Long-CoTNAP-LLaDAGainDream Long-CoTNAP-DreamGain
256421.426.6+5.216.223.8+7.6
336326.635.4+8.825.631.4+5.8
512241.243.0+1.840.043.0+3.0
1024145.047.0+2.047.449.6+2.2

GPQA (Accuracy, %)

StepsTok/StepLLaDA Long-CoTNAP-LLaDAGainDream Long-CoTNAP-DreamGain
336315.419.0+3.67.310.5+3.2
512221.225.9+4.719.422.5+3.1
1024123.028.6+5.628.629.5+0.9

ARness Findings

  • AO decoding ARness (Global-ARness@1):
    • LLaDA-8B: 0.73
    • Dream-7B: 0.92
  • After Long-CoT SFT:
    • LLaDA-8B: 0.73 -> 0.81 (+0.08)
    • Dream-7B: 0.92 -> 0.93 (+0.01)

These numbers indicate that standard supervision tends to increase autoregressive decoding behavior.

Citation