Prophet: Fast Decoding for Diffusion Language Models

April 28, 2026 · View on GitHub

Official implementation of "Diffusion Language Models Know the Answer Before Decoding"

Overview

Prophet is a training-free early-exit decoding paradigm for Diffusion Language Models (DLMs) that leverages the observation that correct answers often emerge early in the decoding process, well before the final step.

Key Features

Training-free: No additional training required, works directly with existing DLMs
Dynamic early-exit: Uses confidence gap between top-2 predictions as stopping criterion
Significant speedup: Up to 2.67× faster on planning tasks, 2.34× on general tasks
Quality preservation: Maintains or improves generation quality compared to full decoding
Model-agnostic: Compatible with different DLMs (tested on LLaDA-8B and Dream-7B)

Installation

Requirements

pip install torch>=2.0.0 transformers==4.38.2 accelerate datasets tqdm

Quick Start

git clone https://github.com/pixeli99/Prophet.git
cd Prophet

Usage

Basic Generation with Prophet Early Exit

import torch
from transformers import AutoModel, AutoTokenizer
from generate_earlyexit import generate

# Load model
model = AutoModel.from_pretrained(
    'GSAI-ML/LLaDA-8B-Instruct', 
    trust_remote_code=True, 
    torch_dtype=torch.bfloat16
).cuda().eval()

tokenizer = AutoTokenizer.from_pretrained(
    'GSAI-ML/LLaDA-8B-Instruct', 
    trust_remote_code=True
)

# Prepare prompt
prompt = "What is the capital of France?"
messages = [{"role": "user", "content": prompt}]
prompt_text = tokenizer.apply_chat_template(messages, add_generation_prompt=True, tokenize=False)
input_ids = tokenizer(prompt_text, return_tensors="pt")['input_ids'].cuda()

# Generate with Prophet early exit
output, gap_data = generate(
    model, 
    input_ids,
    steps=256,
    gen_length=256,
    block_length=32,
    analyze_gap=True,
    answer_start_pos=input_ids.shape[1] + 200,  # Estimated answer position
    early_exit_thresholds={'early': 7.5, 'mid': 5.0, 'late': 2.5}
)

# Decode output
generated_text = tokenizer.decode(output[0, input_ids.shape[1]:], skip_special_tokens=True)
print(f"Generated: {generated_text}")
print(f"Early exit: {gap_data['exit_info']['early_exit_triggered']} at step {gap_data['exit_info']['exit_decision_step']}")

Evaluation

Running Benchmarks

We provide evaluation scripts compatible with lm-evaluation-harness.

GSM8K Evaluation with Prophet

export HF_ENDPOINT=https://hf-mirror.com
export CUDA_VISIBLE_DEVICES=0,1

accelerate launch eval_llada.py \
  --tasks gsm8k_cot_zeroshot \
  --model llada_dist \
  --model_args model_path='/path/to/LLaDA-8B-Instruct',enable_early_exit=true,constraints_text="200:The|201:answer|202:is",gen_length=256,steps=256,block_length=32

Configuration Parameters

Prophet Early Exit Parameters

enable_early_exit: Enable Prophet early exit mechanism (default: false)
early_threshold: Gap threshold for early phase (0-33% progress, default: 7.5)
mid_threshold: Gap threshold for middle phase (33-67% progress, default: 5.0)
late_threshold: Gap threshold for late phase (67-100% progress, default: 2.5)
constraints_text: Constraint tokens to force generation (e.g., "200:The|201:answer|202:is")
answer_length: Number of answer tokens to monitor (default: 5)

Generation Parameters

steps: Total diffusion steps (default: 256)
gen_length: Maximum generation length (default: 256)
block_length: Block size for semi-autoregressive generation (default: 32)
temperature: Sampling temperature (default: 0.0)
cfg_scale: Classifier-free guidance scale (default: 0.0)
remasking: Remasking strategy ('low_confidence' or 'random', default: 'low_confidence')

Core Components

`generate_earlyexit.py`

Main generation function with Prophet early exit mechanism:

generate(): LLaDA generation with logits gap monitoring
should_early_exit(): Phase-aware exit decision logic

`generate.py`

Baseline LLaDA generation without early exit for comparison.

`eval_llada.py`

Evaluation harness integration with Prophet support.

`analysis/`

Trajectory-collection scripts (collect_decoding_traj_gsm8k.py, collect_decoding_traj.sh) and the figure-generation notebook (visualize.ipynb) for the answer-emergence analyses.

Analysis: Decoding Trajectory Visualization

The analysis/ folder reproduces the answer-emergence analyses and figures from the paper (e.g., the histograms showing what fraction of samples reach the correct answer by 25%/50% of the decoding steps, and the per-position change heatmaps). There are two ways to obtain the per-step decoding trajectories the notebook consumes:

Option 1: Collect trajectories yourself

cd analysis
bash collect_decoding_traj.sh

This runs collect_decoding_traj_gsm8k.py across all four {decode_policy} × {constraint_policy} configurations on the full GSM8K test split (1,319 questions) and writes per-question .pt files to question_histories_*/ folders. Each file stores:

x0_history — the model's denoised prediction (token IDs) at every decoding step, organised by block.
true_indices_history — which positions get committed at each step.
Answer metadata (pred_text, pred_ans, gt_text, ans_posidx, prompt_token_len, gen_ids, …).

Option 2: Download our precomputed trajectories from Hugging Face

We release the trajectories used in the paper for both GSM8K (1,319 questions × 4 settings) and MMLU-STEM (3,153 questions × 4 settings) on the Hugging Face Hub: YefanZhou98/DLM-Decoding-Analysis.

from huggingface_hub import snapshot_download

local_dir = snapshot_download(
    repo_id="YefanZhou98/DLM-Decoding-Analysis",
    repo_type="dataset",
)

See the dataset card for the full schema, per-folder breakdown, and decoding configurations.

Reproduce the figures

Open analysis/visualize.ipynb and run the cells. The notebook expects the question_histories_*/ folders to be reachable from the current working directory (either generated locally via Option 1 or downloaded from Hugging Face via Option 2).

Citation

If you use Prophet in your research, please cite:

@article{li2025diffusion,
  title={Diffusion Language Models Know the Answer Before Decoding},
  author={Li, Pengxiang and Zhou, Yefan and Muhtar, Dilxat and Yin, Lu and Yan, Shilin and Shen, Li and Liang, Yi and Vosoughi, Soroush and Liu, Shiwei},
  journal={arXiv preprint arXiv:2508.19982},
  year={2025}
}

Acknowledgments

This implementation is based on the LLaDA model architecture. We thank the LLaDA team for open-sourcing their models and code.