Install eval harness
April 30, 2026 ยท View on GitHub
Turning the TIDE: Cross-Architecture Distillation for Diffusion Large Language Models
๐ The first cross-architecture distillation framework for diffusion LLMs โ 8B dense and 16B MoE teachers into a 0.6B student ๐
Gongbo Zhang1 ย ยทย Wen Wang2 ย ยทย Ye Tian1 ย ยทย Li Yuan1,*
1 Peking University ย ยทย 2 Zhejiang University ย (* corresponding author)
โจ Highlights
- +1.53 average gain over the non-distilled BD3LM baseline across 8 benchmarks (34.20 vs. 32.67).
- +16.48 on HumanEval over the equivalent-size AR baseline (48.78 vs. 32.30) โ distilled dLLMs especially excel at code generation.
- 22ร peak-memory reduction vs. the 16B MoE LLaDA2 teacher (1.4 GB vs. 31.3 GB) and 5.2ร faster inference (6.25 s vs. 32.55 s for 256 tokens on H100), enabling commodity-hardware deployment.
All numbers reported in the paper โ see arxiv.org/abs/2604.26951 for full setup and ablations.
๐ The TIDE Framework
| Component | Paper | Role | One-line description |
|---|---|---|---|
| TIDAL | ยง2.1 | Scheduling โ when to learn | Dual-axis interpolation along training-progress AND diffusion-timestep axes; deweights the teacher at high masking ratios where it is unreliable. Generalizes prior single-axis interpolation to the diffusion setting. |
| CompDemo | ยง2.2 | Contextual โ what to enrich | Two-pass teacher inference with complementary mask splits; every masked position sees ~50% revealed context, raising teacher signal quality at high noise. |
| Reverse CALM | ยง2.3 | Output โ how to project | Reverse-direction chunk-level binary cross-entropy for cross-tokenizer matching. Bounded gradient coefficient (depends only on the fixed teacher) and dual-end noise filtering; equivalent to a Bernoulli-KL mode-seeking objective. |
๐ Two Pipelines ร Two Strategies
Headline finding (ยง3.2): each pipeline favors its native strategy.
- Cross-Tokenizer (LLaDA2 โ BD3LM): native = TIDE-Cross = Reverse CALM. Bounded-gradient mode-seeking tolerates the alignment noise from chunk-level cross-tokenizer matching. Beats the swapped TIDE-Shared by avg +0.37.
- Shared-Tokenizer (WeDLM โ BD3LM): native = TIDE-Shared = TIDAL + CompDemo (over forward KL). Progressive scheduling and enriched signals work best when token-level alignment is exact. Beats the swapped TIDE-Cross by avg +2.76.
| Pipeline | Teacher | Student | Tokenizer | Native strategy | Paper avg |
|---|---|---|---|---|---|
| A โ Cross-Tokenizer | LLaDA2.0-mini (16B MoE) | Qwen3-0.6B-BD3LM | Cross (chunk align via tokenkit) | TIDE-Cross = Reverse CALM | 34.20 |
| B โ Shared-Tokenizer | WeDLM-8B-Instruct (8B dense) | Qwen3-0.6B-BD3LM | Shared (vocab 151646) | TIDE-Shared = TIDAL + CompDemo | 33.55 |
๐ Main Results
Main results across eight benchmarks. All distillation methods include a cross-entropy loss term. Bold: best among dLLM models; italic: second best.
| Benchmark | Qwen3-0.6B | Shared-Tokenizer | Cross-Tokenizer | |||||
|---|---|---|---|---|---|---|---|---|
| AR | BD3LM | KL | TIDE-Cross | TIDE-Shared | CALM | TIDE-Shared | TIDE-Cross | |
| GSM8K | 59.60 | 45.56 | 43.97 | 45.03 | 48.98 | 48.60 | 49.89 | 52.24 |
| MATH | 32.40 | 13.08 | 9.40 | 9.76 | 11.16 | 13.14 | 12.98 | 13.20 |
| BBH | 41.50 | 26.32 | 25.79 | 26.00 | 26.79 | 24.21 | 26.85 | 27.37 |
| MMLU-Pro | 24.70 | 13.80 | 13.19 | 12.88 | 14.48 | 13.47 | 14.02 | 14.52 |
| HellaSwag | 47.40 | 39.28 | 39.78 | 39.50 | 40.50 | 40.42 | 39.57 | 39.88 |
| MMLU | 52.80 | 39.15 | 39.57 | 39.09 | 39.92 | 39.42 | 39.54 | 39.59 |
| HumanEval | 32.30 | 46.34 | 41.46 | 42.68 | 48.78 | 43.90 | 49.39 | 48.17 |
| MBPP | 36.60 | 37.80 | 31.20 | 31.40 | 37.80 | 34.80 | 38.40 | 38.60 |
| Avg | 40.91 | 32.67 | 30.55 | 30.79 | 33.55 | 32.25 | 33.83 | 34.20 |
See the paper (ยง3.2) at arxiv.org/abs/2604.26951 for the full discussion.
๐งญ Paper Variants โ Code Modes
This is the only place in the README where the legacy CLI strings alm / taid appear, because the --distill_mode flag values include them.
| Paper variant | Pipeline | Command | Notes |
|---|---|---|---|
| CALM (baseline, Cross-Tok) | A | distill_llada2.sh --distill_mode alm | โ |
| TIDE-Cross (native, Cross-Tok) | A | distill_llada2.sh --distill_mode reverse_alm | โ |
| TIDE-Shared (in Cross-Tok pipeline) | A | distill_llada2.sh --distill_mode alm_taid --use_comp_demo True | TIDAL + CompDemo |
| KL (baseline, Shared-Tok) | B | distill_wedlm.sh --distill_mode kl_aligned | โ |
| TIDE-Shared (native, Shared-Tok) | B | distill_wedlm.sh --distill_mode taid_aligned --use_comp_demo True | TIDAL + CompDemo |
| TIDE-Cross (in Shared-Tok pipeline) | B | distill_wedlm.sh --distill_mode reverse_kl_aligned | โ |
๐ก Note on combinations. TIDAL is applied only to forward objectives. As discussed in the paper's gradient-analysis appendix, combining TIDAL with reverse objectives is counterproductive โ the late-training factor suppresses the self-selection mechanism of Reverse CALM.
โ๏ธ Setup
# Create environment
conda create -n dllm python=3.10 -y && conda activate dllm
# Install PyTorch (CUDA 12.4)
conda install cuda=12.4 -c nvidia
pip install torch==2.6.0 torchvision==0.21.0 torchaudio==2.6.0 \
--index-url https://download.pytorch.org/whl/cu124
# Install dllm
pip install -e .
# Initialize submodules (lm-evaluation-harness + tokenkit)
git submodule update --init --recursive
# Install eval harness
pip install -e "lm-evaluation-harness[ifeval,math]"
# Install tokenkit (required for Pipeline A cross-tokenizer distillation)
pip install -e "tokenkit[full]"
๐ฆ Released Models & Data
Six distilled student checkpoints (3 per pipeline) are released under ๐ค TIDE-dllm Models, and two preprocessed SFT datasets are released under ๐ค TIDE-dllm Datasets.
Distilled student checkpoints
| Pipeline | Variant | ๐ค Repo |
|---|---|---|
| A โ Cross-Tokenizer (LLaDA2 teacher) | TIDE-Cross (native) | distill-LLaDA2-TIDE_Cross |
| A โ Cross-Tokenizer (LLaDA2 teacher) | TIDE-Shared variant | distill-LLaDA2-TIDE_Shared |
| A โ Cross-Tokenizer (LLaDA2 teacher) | CALM baseline | distill-LLaDA2-CALM |
| B โ Shared-Tokenizer (WeDLM teacher) | TIDE-Shared (native) | distill-WeDLM-TIDE_Shared |
| B โ Shared-Tokenizer (WeDLM teacher) | TIDE-Cross variant | distill-WeDLM-TIDE_Cross |
| B โ Shared-Tokenizer (WeDLM teacher) | KL baseline | distill-WeDLM-KL |
Preprocessed SFT datasets
Both datasets share the same composition as dllm-hub/Qwen3-0.6B-diffusion-bd3lm-v0.1 โ tulu-3-sft-mixture + smoltalk + opc-sft-stage1 + opc-sft-stage2 โ but tokenized for each teacher in advance to avoid NCCL timeouts during distillation.
| Pipeline | ๐ค Repo |
|---|---|
| A โ for the LLaDA2 teacher | distill_llada2_sft |
| B โ for the WeDLM teacher | distill_wedlm_sft |
Download
pip install "huggingface_hub[cli]"
# Distilled checkpoint (example: native TIDE-Cross from Pipeline A)
huggingface-cli download TIDE-dllm/distill-LLaDA2-TIDE_Cross \
--local-dir ckpts/distill-LLaDA2-TIDE_Cross
# Preprocessed datasets
huggingface-cli download TIDE-dllm/distill_llada2_sft \
--repo-type dataset --local-dir data/distill_llada2_sft
huggingface-cli download TIDE-dllm/distill_wedlm_sft \
--repo-type dataset --local-dir data/distill_wedlm_sft
Project page: pku-yuangroup.github.io/TIDE-Page.
๐ Quick Start
1. Data Preprocessing
Distillation requires offline-preprocessed data to avoid NCCL timeout during tokenization. The fastest path is to download our preprocessed datasets from TIDE-dllm (see ๐ฆ Released Models & Data above):
huggingface-cli download TIDE-dllm/distill_llada2_sft \
--repo-type dataset --local-dir data/distill_llada2_preprocessed
huggingface-cli download TIDE-dllm/distill_wedlm_sft \
--repo-type dataset --local-dir data/distill_wedlm_preprocessed
If you'd rather preprocess from scratch, the examples below use tatsu-lab/alpaca for a quick smoke test. To reproduce the paper, replace the --dataset value with:
allenai/tulu-3-sft-mixture+HuggingFaceTB/smoltalk+OpenCoder-LLM/opc-sft-stage1[lang:python]+OpenCoder-LLM/opc-sft-stage2[lang:python]
Pipeline A (LLaDA2, cross-tokenizer):
bash scripts/preprocess_llada2_data.sh \
--dataset tatsu-lab/alpaca \
--output_dir data/distill_llada2_preprocessed
Pipeline B (WeDLM, same-tokenizer):
bash scripts/preprocess_wedlm_data.sh \
--dataset tatsu-lab/alpaca \
--output_dir data/distill_wedlm_preprocessed
2. Distillation Training
The recommended command for each pipeline runs the native strategy (paper-best per ยง3.2).
Pipeline A โ LLaDA2 teacher, TIDE-Cross (Reverse CALM):
bash scripts/distill_llada2.sh \
--data_path data/distill_llada2_preprocessed \
--distill_mode reverse_alm \
--num_gpus 8
Pipeline B โ WeDLM teacher, TIDE-Shared (TIDAL + CompDemo):
bash scripts/distill_wedlm.sh \
--data_path data/distill_wedlm_preprocessed \
--distill_mode taid_aligned \
--use_comp_demo True \
--num_gpus 8
๐ All training script parameters
Both distill_llada2.sh and distill_wedlm.sh support:
| Parameter | Default | Description |
|---|---|---|
--data_path | required | Preprocessed data directory or HF dataset name |
--output_dir | output/distill_* | Checkpoint output directory |
--num_gpus | 8 | Number of GPUs |
--distill_mode | alm / taid_aligned | Distillation mode (see Paper Variants โ Code Modes table above) |
--use_comp_demo | False | Enable CompDemo (complementary demonstration) |
--epochs | 2 / 3 | Number of training epochs |
--lr | 5e-5 | Learning rate |
--batch_size | 8 / 10 | Per-device batch size |
--student_model | dllm-collection/Qwen3-0.6B-diffusion-bd3lm-v0.1 | Student model |
--teacher_model | inclusionAI/LLaDA2.0-mini / tencent/WeDLM-8B-Instruct | Teacher model |
WeDLM-specific (TIDAL controls):
| Parameter | Default | Description |
|---|---|---|
--taid_axis_mode | both | TIDAL axis: both, training_only, timestep_only |
--taid_timestep_weight | midrange | Timestep weighting: uniform, midrange |
--shared_vocab_size | 151646 | Shared vocabulary size |
--teacher_mask_token_id | 151665 | Teacher mask token ID |
3. Evaluation
Run all 8 benchmarks on a trained checkpoint:
bash scripts/eval_all.sh --model_path /path/to/checkpoint --num_gpus 8
Benchmarks: mmlu_generative_dream, mmlu_pro, hellaswag_gen, gsm8k_cot, bbh, minerva_math, humaneval_instruct, mbpp_instruct.
Evaluation protocol: block size 32, CFG scale 0.0, sampling steps from 3 (HellaSwag/MMLU) up to 256 (everything else). Results are saved to
eval_results/by default (override with--output_dir).
๐ Training Hyperparameters
Training settings used for the paper experiments.
| Parameter | Cross-Tokenizer (Pipeline A) | Shared-Tokenizer (Pipeline B) |
|---|---|---|
| Teacher | LLaDA2.0-mini (16B MoE) | WeDLM-8B-Instruct (8B) |
| Student init | Qwen3-0.6B-BD3LM SFT v0.1 | Qwen3-0.6B-BD3LM SFT v0.1 |
| Native method | Reverse CALM | TIDAL + CompDemo |
| Learning rate | 5e-5 | 5e-5 |
| Epochs | 10 | 10 |
| Student / teacher seq length | 512 / 1024 | 512 / 768 |
| Block size | 32 | 32 |
| Precision | bfloat16 | bfloat16 |
| TIDAL | โ | $0.1 \to 0.9$, cosine, midrange weighting |
| CompDemo demo_ratio | โ | 0.5 |
| Temperature | โ | 2.0 |
| Dataset | Tulu-3 SFT + SmolTalk + OpenCoder-SFT-1/2 (Python) | (same) |
๐ ๏ธ Troubleshooting
ValueError: Sequence length N exceeds pad_to_length M during training
For *_aligned modes (Pipeline B) the preprocessing script does not truncate samples to --max_length โ it only filters samples whose prompt alone exceeds it. The training --max_length (and --teacher_max_length) must therefore be at least as large as the value used during preprocessing. The simplest rule: pass the same --max_length to both preprocess_wedlm_data.sh and distill_wedlm.sh.
Pipeline B taid_aligned requires aligned preprocessed data
The default --align_mode of preprocess_wedlm_data.sh is kl_aligned, which produces the dual-tokenizer fields (teacher_input_ids, align_student, align_teacher) needed by *_aligned training modes. If you preprocessed with --align_mode none, training in any *_aligned mode will crash with KeyError: 'teacher_input_ids'. Re-run preprocessing without overriding --align_mode.
๐ File Structure
dllm/core/trainers/
โโโ distill_bd3lm.py # DistillBD3LMTrainer โ all distillation modes (TIDAL, CompDemo, CALM, Reverse CALM, plus baselines)
โโโ distill_collator.py # DistillCollator โ chunk-level CALM alignment via tokenkit (paper ยง2.3)
โโโ bd3lm.py # BD3LMTrainer (base block diffusion trainer)
โโโ mdlm.py # MDLMTrainer (base masked diffusion trainer)
โโโ losses/
โโโ taid.py # TIDAL loss implementation (paper ยง2.1)
examples/a2d/bd3lm/
โโโ distill.py # Pipeline A entry: LLaDA2 cross-tokenizer distillation
โโโ distill_wedlm.py # Pipeline B entry: WeDLM same-tokenizer distillation
โโโ distill_utils.py # Shared utilities (alignment, tokenization)
โโโ preprocess_distill_data.py # Data preprocessing for Pipeline A
โโโ preprocess_distill_wedlm_data.py # Data preprocessing for Pipeline B
scripts/
โโโ distill_llada2.sh # One-click training: Pipeline A
โโโ distill_wedlm.sh # One-click training: Pipeline B
โโโ eval_all.sh # One-click evaluation (8 benchmarks)
โโโ preprocess_llada2_data.sh # One-click preprocessing: Pipeline A
โโโ preprocess_wedlm_data.sh # One-click preprocessing: Pipeline B
๐ Citation
If you find TIDE useful for your research, please consider citing:
@misc{zhang2026turningtidecrossarchitecturedistillation,
title={Turning the TIDE: Cross-Architecture Distillation for Diffusion Large Language Models},
author={Gongbo Zhang and Wen Wang and Ye Tian and Li Yuan},
year={2026},
eprint={2604.26951},
archivePrefix={arXiv},
primaryClass={cs.CL},
url={https://arxiv.org/abs/2604.26951},
}
๐ Acknowledgements
Built on the dLLM library; cross-tokenizer alignment via tokenkit; evaluation through lm-evaluation-harness.