Decoding Methods
May 26, 2026 ยท View on GitHub
Decoding methods accelerate DLLM inference by changing the denoising schedule, token selection rule, or early-exit policy.
Supported Methods
| Method | Paper | Config |
|---|---|---|
| Top-k | / | generation=vanilla generation.steps=128 |
| Confidence Threshold | / | generation=vanilla generation.threshold=0.9 |
| UNCODE | Paper | generation=pc_sampler |
| EB-Sampler | Paper | generation=eb_sampler |
| KLASS | Paper | generation=klass |
| WINO | Paper | generation=wino |
| dParallel | Paper | model=dparallel_llada-inst |
| DAEDAL | Paper | generation=daedal |
Quick Start
Run all decoding-method examples on LLaDA and GSM8K:
bash scripts/run_decoding_methods.sh
Method Examples
The examples below use model=llada-inst and dataset.name=gsm8k, except where a method requires its own model config, such as dParallel. For standard model examples, use model=dream-inst and set DREAM_INST_PATH to run with Dream; to switch benchmarks, replace dataset.name=gsm8k with any task available through lm-eval or the local tasks/ directory. Example tasks include humaneval_instruct, math-500, mbpp_instruct, ifeval, gpqa_main_generative_n_shot, and longbench.
Top-k
accelerate launch \
--num_machines 1 \
--num_processes 1 \
eval.py \
dataset.name=gsm8k \
batch_size=1 \
seed=1234 \
attn_implementation=sdpa \
flash_attention=true \
generation=vanilla \
generation.block_length=32 \
generation.gen_length=256 \
generation.steps=128 \
model=llada-inst \
hydra.run.dir=./outputs/examples/decoding_methods/llada-inst-topk-gsm8k
Note: Top-k reduces the number of denoising steps by setting generation.steps smaller than generation.gen_length.
Confidence Threshold
accelerate launch \
--num_machines 1 \
--num_processes 1 \
eval.py \
dataset.name=gsm8k \
batch_size=1 \
seed=1234 \
attn_implementation=sdpa \
flash_attention=true \
generation=vanilla \
generation.threshold=0.9 \
generation.block_length=32 \
generation.gen_length=256 \
generation.steps=256 \
model=llada-inst \
hydra.run.dir=./outputs/examples/decoding_methods/llada-inst-confidence-threshold-gsm8k
Note: Default is generation.threshold=0.9; all tokens with confidence greater than or equal to the threshold are accepted in the current denoising step.
UNCODE
accelerate launch \
--num_machines 1 \
--num_processes 1 \
eval.py \
dataset.name=gsm8k \
batch_size=1 \
seed=1234 \
attn_implementation=sdpa \
flash_attention=true \
generation=pc_sampler \
generation.debias=true \
generation.clip_alpha=10 \
generation.block_length=32 \
generation.gen_length=256 \
generation.steps=256 \
model=llada-inst \
hydra.run.dir=./outputs/examples/decoding_methods/llada-inst-uncode-gsm8k
Note: Defaults are generation.debias=true and generation.clip_alpha=10; debias enables token-frequency correction, and clip_alpha caps the correction strength.
EB-Sampler
accelerate launch \
--num_machines 1 \
--num_processes 1 \
eval.py \
dataset.name=gsm8k \
batch_size=1 \
seed=1234 \
attn_implementation=sdpa \
flash_attention=true \
generation=eb_sampler \
generation.gamma=0.001 \
generation.block_length=32 \
generation.gen_length=256 \
generation.steps=256 \
model=llada-inst \
hydra.run.dir=./outputs/examples/decoding_methods/llada-inst-eb-sampler-gsm8k
Note: Default is generation.gamma=0.001; gamma controls the EB-Sampler token-selection threshold used to decide how many tokens to accept.
KLASS
accelerate launch \
--num_machines 1 \
--num_processes 1 \
eval.py \
dataset.name=gsm8k \
batch_size=1 \
seed=1234 \
attn_implementation=sdpa \
flash_attention=true \
generation=klass \
generation.kl_threshold=0.01 \
generation.kl_history_length=2 \
generation.block_length=32 \
generation.gen_length=256 \
generation.steps=256 \
model=llada-inst \
hydra.run.dir=./outputs/examples/decoding_methods/llada-inst-klass-gsm8k
Note: Defaults are generation.kl_threshold=0.01 and generation.kl_history_length=2; kl_threshold marks tokens as stable when their KL change stays below the threshold, and kl_history_length sets the history window.
WINO
accelerate launch \
--num_machines 1 \
--num_processes 1 \
eval.py \
dataset.name=gsm8k \
batch_size=1 \
seed=1234 \
attn_implementation=sdpa \
flash_attention=true \
generation=wino \
generation.wide_in_thres=0.7 \
generation.narrow_out_thres=0.9 \
generation.block_length=32 \
generation.gen_length=256 \
generation.steps=256 \
model=llada-inst \
hydra.run.dir=./outputs/examples/decoding_methods/llada-inst-wino-gsm8k
Note: Defaults are generation.wide_in_thres=0.7 and generation.narrow_out_thres=0.9; wide_in_thres controls the wide-in candidate update set, and narrow_out_thres controls which low-confidence tokens are remasked.
dParallel
accelerate launch \
--num_machines 1 \
--num_processes 1 \
eval.py \
dataset.name=gsm8k \
batch_size=1 \
seed=1234 \
attn_implementation=sdpa \
flash_attention=true \
generation=vanilla \
generation.block_length=32 \
generation.gen_length=256 \
generation.steps=256 \
model=dparallel_llada-inst \
hydra.run.dir=./outputs/examples/decoding_methods/dparallel-llada-inst-gsm8k
Note: Default model override is model=dparallel_llada-inst; dParallel uses the dParallel model wrapper while keeping the same vanilla generation config.
DAEDAL
accelerate launch \
--num_machines 1 \
--num_processes 1 \
eval.py \
dataset.name=gsm8k \
batch_size=1 \
seed=1234 \
attn_implementation=sdpa \
flash_attention=true \
generation=daedal \
generation.initial_gen_length=64 \
generation.block_length=32 \
generation.gen_length=256 \
generation.steps=256 \
generation.max_gen_length=2048 \
model=llada-inst \
hydra.run.dir=./outputs/examples/decoding_methods/llada-inst-daedal-gsm8k
Note: Defaults are generation.initial_gen_length=64 and generation.max_gen_length=2048; initial_gen_length sets the initial decoding budget, and max_gen_length caps adaptive expansion.