Decoding Methods

May 26, 2026 · View on GitHub

Decoding methods accelerate DLLM inference by changing the denoising schedule, token selection rule, or early-exit policy.

Supported Methods

Method	Paper	Config
Top-k	/	`generation=vanilla generation.steps=128`
Confidence Threshold	/	`generation=vanilla generation.threshold=0.9`
UNCODE	Paper	`generation=pc_sampler`
EB-Sampler	Paper	`generation=eb_sampler`
KLASS	Paper	`generation=klass`
WINO	Paper	`generation=wino`
dParallel	Paper	`model=dparallel_llada-inst`
DAEDAL	Paper	`generation=daedal`

Quick Start

Run all decoding-method examples on LLaDA and GSM8K:

bash scripts/run_decoding_methods.sh

The examples below use model=llada-inst and dataset.name=gsm8k, except where a method requires its own model config, such as dParallel. For standard model examples, use model=dream-inst and set DREAM_INST_PATH to run with Dream; to switch benchmarks, replace dataset.name=gsm8k with any task available through lm-eval or the local tasks/ directory. Example tasks include humaneval_instruct, math-500, mbpp_instruct, ifeval, gpqa_main_generative_n_shot, and longbench.

Top-k

accelerate launch \
    --num_machines 1 \
    --num_processes 1 \
    eval.py \
    dataset.name=gsm8k \
    batch_size=1 \
    seed=1234 \
    attn_implementation=sdpa \
    flash_attention=true \
    generation=vanilla \
    generation.block_length=32 \
    generation.gen_length=256 \
    generation.steps=128 \
    model=llada-inst \
    hydra.run.dir=./outputs/examples/decoding_methods/llada-inst-topk-gsm8k

Note: Top-k reduces the number of denoising steps by setting generation.steps smaller than generation.gen_length.

Confidence Threshold

accelerate launch \
    --num_machines 1 \
    --num_processes 1 \
    eval.py \
    dataset.name=gsm8k \
    batch_size=1 \
    seed=1234 \
    attn_implementation=sdpa \
    flash_attention=true \
    generation=vanilla \
    generation.threshold=0.9 \
    generation.block_length=32 \
    generation.gen_length=256 \
    generation.steps=256 \
    model=llada-inst \
    hydra.run.dir=./outputs/examples/decoding_methods/llada-inst-confidence-threshold-gsm8k

Note: Default is generation.threshold=0.9; all tokens with confidence greater than or equal to the threshold are accepted in the current denoising step.

UNCODE

accelerate launch \
    --num_machines 1 \
    --num_processes 1 \
    eval.py \
    dataset.name=gsm8k \
    batch_size=1 \
    seed=1234 \
    attn_implementation=sdpa \
    flash_attention=true \
    generation=pc_sampler \
    generation.debias=true \
    generation.clip_alpha=10 \
    generation.block_length=32 \
    generation.gen_length=256 \
    generation.steps=256 \
    model=llada-inst \
    hydra.run.dir=./outputs/examples/decoding_methods/llada-inst-uncode-gsm8k

Note: Defaults are generation.debias=true and generation.clip_alpha=10; debias enables token-frequency correction, and clip_alpha caps the correction strength.

EB-Sampler

accelerate launch \
    --num_machines 1 \
    --num_processes 1 \
    eval.py \
    dataset.name=gsm8k \
    batch_size=1 \
    seed=1234 \
    attn_implementation=sdpa \
    flash_attention=true \
    generation=eb_sampler \
    generation.gamma=0.001 \
    generation.block_length=32 \
    generation.gen_length=256 \
    generation.steps=256 \
    model=llada-inst \
    hydra.run.dir=./outputs/examples/decoding_methods/llada-inst-eb-sampler-gsm8k

Note: Default is generation.gamma=0.001; gamma controls the EB-Sampler token-selection threshold used to decide how many tokens to accept.

KLASS

accelerate launch \
    --num_machines 1 \
    --num_processes 1 \
    eval.py \
    dataset.name=gsm8k \
    batch_size=1 \
    seed=1234 \
    attn_implementation=sdpa \
    flash_attention=true \
    generation=klass \
    generation.kl_threshold=0.01 \
    generation.kl_history_length=2 \
    generation.block_length=32 \
    generation.gen_length=256 \
    generation.steps=256 \
    model=llada-inst \
    hydra.run.dir=./outputs/examples/decoding_methods/llada-inst-klass-gsm8k

Note: Defaults are generation.kl_threshold=0.01 and generation.kl_history_length=2; kl_threshold marks tokens as stable when their KL change stays below the threshold, and kl_history_length sets the history window.

WINO

accelerate launch \
    --num_machines 1 \
    --num_processes 1 \
    eval.py \
    dataset.name=gsm8k \
    batch_size=1 \
    seed=1234 \
    attn_implementation=sdpa \
    flash_attention=true \
    generation=wino \
    generation.wide_in_thres=0.7 \
    generation.narrow_out_thres=0.9 \
    generation.block_length=32 \
    generation.gen_length=256 \
    generation.steps=256 \
    model=llada-inst \
    hydra.run.dir=./outputs/examples/decoding_methods/llada-inst-wino-gsm8k

Note: Defaults are generation.wide_in_thres=0.7 and generation.narrow_out_thres=0.9; wide_in_thres controls the wide-in candidate update set, and narrow_out_thres controls which low-confidence tokens are remasked.

dParallel

accelerate launch \
    --num_machines 1 \
    --num_processes 1 \
    eval.py \
    dataset.name=gsm8k \
    batch_size=1 \
    seed=1234 \
    attn_implementation=sdpa \
    flash_attention=true \
    generation=vanilla \
    generation.block_length=32 \
    generation.gen_length=256 \
    generation.steps=256 \
    model=dparallel_llada-inst \
    hydra.run.dir=./outputs/examples/decoding_methods/dparallel-llada-inst-gsm8k

Note: Default model override is model=dparallel_llada-inst; dParallel uses the dParallel model wrapper while keeping the same vanilla generation config.

DAEDAL

accelerate launch \
    --num_machines 1 \
    --num_processes 1 \
    eval.py \
    dataset.name=gsm8k \
    batch_size=1 \
    seed=1234 \
    attn_implementation=sdpa \
    flash_attention=true \
    generation=daedal \
    generation.initial_gen_length=64 \
    generation.block_length=32 \
    generation.gen_length=256 \
    generation.steps=256 \
    generation.max_gen_length=2048 \
    model=llada-inst \
    hydra.run.dir=./outputs/examples/decoding_methods/llada-inst-daedal-gsm8k

Note: Defaults are generation.initial_gen_length=64 and generation.max_gen_length=2048; initial_gen_length sets the initial decoding budget, and max_gen_length caps adaptive expansion.