Install eval harness

April 30, 2026 · View on GitHub

Turning the TIDE: Cross-Architecture Distillation for Diffusion Large Language Models

🌊 The first cross-architecture distillation framework for diffusion LLMs — 8B dense and 16B MoE teachers into a 0.6B student 🌊

Gongbo Zhang¹ · Wen Wang² · Ye Tian¹ · Li Yuan^1,*

¹ Peking University · ² Zhejiang University (^* corresponding author)

This repository is the official implementation of TIDE, the first framework for cross-architecture dLLM distillation. While prior work focuses on step compression within a single architecture, TIDE bridges teachers and students that differ in architecture, attention mechanism, and tokenizer, via three modular components — TIDAL, CompDemo, and Reverse CALM.

TIDE: cross-architecture distillation overview

✨ Highlights

+1.53 average gain over the non-distilled BD3LM baseline across 8 benchmarks (34.20 vs. 32.67).
+16.48 on HumanEval over the equivalent-size AR baseline (48.78 vs. 32.30) — distilled dLLMs especially excel at code generation.
22× peak-memory reduction vs. the 16B MoE LLaDA2 teacher (1.4 GB vs. 31.3 GB) and 5.2× faster inference (6.25 s vs. 32.55 s for 256 tokens on H100), enabling commodity-hardware deployment.

All numbers reported in the paper — see arxiv.org/abs/2604.26951 for full setup and ablations.

🌊 The TIDE Framework

TIDE framework: TIDAL + CompDemo + Reverse CALM

Component	Paper	Role	One-line description
TIDAL	§2.1	Scheduling — when to learn	Dual-axis interpolation along training-progress AND diffusion-timestep axes; deweights the teacher at high masking ratios where it is unreliable. Generalizes prior single-axis interpolation to the diffusion setting.
CompDemo	§2.2	Contextual — what to enrich	Two-pass teacher inference with complementary mask splits; every masked position sees ~50% revealed context, raising teacher signal quality at high noise.
Reverse CALM	§2.3	Output — how to project	Reverse-direction chunk-level binary cross-entropy for cross-tokenizer matching. Bounded gradient coefficient (depends only on the fixed teacher) and dual-end noise filtering; equivalent to a Bernoulli-KL mode-seeking objective.

🔄 Two Pipelines × Two Strategies

Headline finding (§3.2): each pipeline favors its native strategy.

Cross-Tokenizer (LLaDA2 → BD3LM): native = TIDE-Cross = Reverse CALM. Bounded-gradient mode-seeking tolerates the alignment noise from chunk-level cross-tokenizer matching. Beats the swapped TIDE-Shared by avg +0.37.

Shared-Tokenizer (WeDLM → BD3LM): native = TIDE-Shared = TIDAL + CompDemo (over forward KL). Progressive scheduling and enriched signals work best when token-level alignment is exact. Beats the swapped TIDE-Cross by avg +2.76.

Pipeline	Teacher	Student	Tokenizer	Native strategy	Paper avg
A — Cross-Tokenizer	LLaDA2.0-mini (16B MoE)	Qwen3-0.6B-BD3LM	Cross (chunk align via tokenkit)	TIDE-Cross = Reverse CALM	34.20
B — Shared-Tokenizer	WeDLM-8B-Instruct (8B dense)	Qwen3-0.6B-BD3LM	Shared (vocab 151646)	TIDE-Shared = TIDAL + CompDemo	33.55

📊 Main Results

Main results across eight benchmarks. All distillation methods include a cross-entropy loss term. Bold: best among dLLM models; italic: second best.

Benchmark	Qwen3-0.6B		Shared-Tokenizer			Cross-Tokenizer
Benchmark	AR	BD3LM	KL	TIDE-Cross	TIDE-Shared	CALM	TIDE-Shared	TIDE-Cross
GSM8K	59.60	45.56	43.97	45.03	48.98	48.60	49.89	52.24
MATH	32.40	13.08	9.40	9.76	11.16	13.14	12.98	13.20
BBH	41.50	26.32	25.79	26.00	26.79	24.21	26.85	27.37
MMLU-Pro	24.70	13.80	13.19	12.88	14.48	13.47	14.02	14.52
HellaSwag	47.40	39.28	39.78	39.50	40.50	40.42	39.57	39.88
MMLU	52.80	39.15	39.57	39.09	39.92	39.42	39.54	39.59
HumanEval	32.30	46.34	41.46	42.68	48.78	43.90	49.39	48.17
MBPP	36.60	37.80	31.20	31.40	37.80	34.80	38.40	38.60
Avg	40.91	32.67	30.55	30.79	33.55	32.25	33.83	34.20

See the paper (§3.2) at arxiv.org/abs/2604.26951 for the full discussion.

🧭 Paper Variants ↔ Code Modes

This is the only place in the README where the legacy CLI strings alm / taid appear, because the --distill_mode flag values include them.

Paper variant	Pipeline	Command	Notes
CALM (baseline, Cross-Tok)	A	`distill_llada2.sh --distill_mode alm`	—
TIDE-Cross (native, Cross-Tok)	A	`distill_llada2.sh --distill_mode reverse_alm`	—
TIDE-Shared (in Cross-Tok pipeline)	A	`distill_llada2.sh --distill_mode alm_taid --use_comp_demo True`	TIDAL + CompDemo
KL (baseline, Shared-Tok)	B	`distill_wedlm.sh --distill_mode kl_aligned`	—
TIDE-Shared (native, Shared-Tok)	B	`distill_wedlm.sh --distill_mode taid_aligned --use_comp_demo True`	TIDAL + CompDemo
TIDE-Cross (in Shared-Tok pipeline)	B	`distill_wedlm.sh --distill_mode reverse_kl_aligned`	—

💡 Note on combinations. TIDAL is applied only to forward objectives. As discussed in the paper's gradient-analysis appendix, combining TIDAL with reverse objectives is counterproductive — the late-training $(1-\lambda_t)$ factor suppresses the self-selection mechanism of Reverse CALM.

⚙️ Setup

# Create environment
conda create -n dllm python=3.10 -y && conda activate dllm

# Install PyTorch (CUDA 12.4)
conda install cuda=12.4 -c nvidia
pip install torch==2.6.0 torchvision==0.21.0 torchaudio==2.6.0 \
    --index-url https://download.pytorch.org/whl/cu124

# Install dllm
pip install -e .

# Initialize submodules (lm-evaluation-harness + tokenkit)
git submodule update --init --recursive

# Install eval harness
pip install -e "lm-evaluation-harness[ifeval,math]"

# Install tokenkit (required for Pipeline A cross-tokenizer distillation)
pip install -e "tokenkit[full]"

📦 Released Models & Data

Six distilled student checkpoints (3 per pipeline) are released under 🤗 TIDE-dllm Models, and two preprocessed SFT datasets are released under 🤗 TIDE-dllm Datasets.

Distilled student checkpoints

Pipeline	Variant	🤗 Repo
A — Cross-Tokenizer (LLaDA2 teacher)	TIDE-Cross (native)	`distill-LLaDA2-TIDE_Cross`
A — Cross-Tokenizer (LLaDA2 teacher)	TIDE-Shared variant	`distill-LLaDA2-TIDE_Shared`
A — Cross-Tokenizer (LLaDA2 teacher)	CALM baseline	`distill-LLaDA2-CALM`
B — Shared-Tokenizer (WeDLM teacher)	TIDE-Shared (native)	`distill-WeDLM-TIDE_Shared`
B — Shared-Tokenizer (WeDLM teacher)	TIDE-Cross variant	`distill-WeDLM-TIDE_Cross`
B — Shared-Tokenizer (WeDLM teacher)	KL baseline	`distill-WeDLM-KL`

Preprocessed SFT datasets

Both datasets share the same composition as dllm-hub/Qwen3-0.6B-diffusion-bd3lm-v0.1 — tulu-3-sft-mixture + smoltalk + opc-sft-stage1 + opc-sft-stage2 — but tokenized for each teacher in advance to avoid NCCL timeouts during distillation.

Pipeline	🤗 Repo
A — for the LLaDA2 teacher	`distill_llada2_sft`
B — for the WeDLM teacher	`distill_wedlm_sft`

Download

pip install "huggingface_hub[cli]"

# Distilled checkpoint (example: native TIDE-Cross from Pipeline A)
huggingface-cli download TIDE-dllm/distill-LLaDA2-TIDE_Cross \
    --local-dir ckpts/distill-LLaDA2-TIDE_Cross

# Preprocessed datasets
huggingface-cli download TIDE-dllm/distill_llada2_sft \
    --repo-type dataset --local-dir data/distill_llada2_sft
huggingface-cli download TIDE-dllm/distill_wedlm_sft \
    --repo-type dataset --local-dir data/distill_wedlm_sft

Project page: pku-yuangroup.github.io/TIDE-Page.

🚀 Quick Start

1. Data Preprocessing

Distillation requires offline-preprocessed data to avoid NCCL timeout during tokenization. The fastest path is to download our preprocessed datasets from TIDE-dllm (see 📦 Released Models & Data above):

huggingface-cli download TIDE-dllm/distill_llada2_sft \
    --repo-type dataset --local-dir data/distill_llada2_preprocessed
huggingface-cli download TIDE-dllm/distill_wedlm_sft \
    --repo-type dataset --local-dir data/distill_wedlm_preprocessed

If you'd rather preprocess from scratch, the examples below use tatsu-lab/alpaca for a quick smoke test. To reproduce the paper, replace the --dataset value with:

allenai/tulu-3-sft-mixture+HuggingFaceTB/smoltalk+OpenCoder-LLM/opc-sft-stage1[lang:python]+OpenCoder-LLM/opc-sft-stage2[lang:python]

Pipeline A (LLaDA2, cross-tokenizer):

bash scripts/preprocess_llada2_data.sh \
    --dataset tatsu-lab/alpaca \
    --output_dir data/distill_llada2_preprocessed

Pipeline B (WeDLM, same-tokenizer):

bash scripts/preprocess_wedlm_data.sh \
    --dataset tatsu-lab/alpaca \
    --output_dir data/distill_wedlm_preprocessed

2. Distillation Training

The recommended command for each pipeline runs the native strategy (paper-best per §3.2).

Pipeline A — LLaDA2 teacher, TIDE-Cross (Reverse CALM):

bash scripts/distill_llada2.sh \
    --data_path data/distill_llada2_preprocessed \
    --distill_mode reverse_alm \
    --num_gpus 8

Pipeline B — WeDLM teacher, TIDE-Shared (TIDAL + CompDemo):

bash scripts/distill_wedlm.sh \
    --data_path data/distill_wedlm_preprocessed \
    --distill_mode taid_aligned \
    --use_comp_demo True \
    --num_gpus 8

📋 All training script parameters

Both distill_llada2.sh and distill_wedlm.sh support:

Parameter	Default	Description
`--data_path`	required	Preprocessed data directory or HF dataset name
`--output_dir`	`output/distill_*`	Checkpoint output directory
`--num_gpus`	`8`	Number of GPUs
`--distill_mode`	`alm` / `taid_aligned`	Distillation mode (see Paper Variants ↔ Code Modes table above)
`--use_comp_demo`	`False`	Enable CompDemo (complementary demonstration)
`--epochs`	`2` / `3`	Number of training epochs
`--lr`	`5e-5`	Learning rate
`--batch_size`	`8` / `10`	Per-device batch size
`--student_model`	`dllm-collection/Qwen3-0.6B-diffusion-bd3lm-v0.1`	Student model
`--teacher_model`	`inclusionAI/LLaDA2.0-mini` / `tencent/WeDLM-8B-Instruct`	Teacher model

WeDLM-specific (TIDAL controls):

Parameter	Default	Description
`--taid_axis_mode`	`both`	TIDAL axis: `both`, `training_only`, `timestep_only`
`--taid_timestep_weight`	`midrange`	Timestep weighting: `uniform`, `midrange`
`--shared_vocab_size`	`151646`	Shared vocabulary size
`--teacher_mask_token_id`	`151665`	Teacher mask token ID

3. Evaluation

Run all 8 benchmarks on a trained checkpoint:

bash scripts/eval_all.sh --model_path /path/to/checkpoint --num_gpus 8

Benchmarks: mmlu_generative_dream, mmlu_pro, hellaswag_gen, gsm8k_cot, bbh, minerva_math, humaneval_instruct, mbpp_instruct.

Evaluation protocol: block size 32, CFG scale 0.0, sampling steps from 3 (HellaSwag/MMLU) up to 256 (everything else). Results are saved to eval_results/ by default (override with --output_dir).

📋 Training Hyperparameters

Training settings used for the paper experiments.

Parameter	Cross-Tokenizer (Pipeline A)	Shared-Tokenizer (Pipeline B)
Teacher	LLaDA2.0-mini (16B MoE)	WeDLM-8B-Instruct (8B)
Student init	Qwen3-0.6B-BD3LM SFT v0.1	Qwen3-0.6B-BD3LM SFT v0.1
Native method	Reverse CALM	TIDAL + CompDemo
Learning rate	5e-5	5e-5
Epochs	10	10
Student / teacher seq length	512 / 1024	512 / 768
Block size	32	32
Precision	bfloat16	bfloat16
TIDAL $\lambda_{\text{init}} \to \lambda_{\max}$	—	$0.1 \to 0.9$, cosine, midrange weighting
CompDemo demo_ratio	—	0.5
Temperature $T$	—	2.0
Dataset	Tulu-3 SFT + SmolTalk + OpenCoder-SFT-1/2 (Python)	(same)

🛠️ Troubleshooting

ValueError: Sequence length N exceeds pad_to_length M during training

For *_aligned modes (Pipeline B) the preprocessing script does not truncate samples to --max_length — it only filters samples whose prompt alone exceeds it. The training --max_length (and --teacher_max_length) must therefore be at least as large as the value used during preprocessing. The simplest rule: pass the same --max_length to both preprocess_wedlm_data.sh and distill_wedlm.sh.

Pipeline B taid_aligned requires aligned preprocessed data

The default --align_mode of preprocess_wedlm_data.sh is kl_aligned, which produces the dual-tokenizer fields (teacher_input_ids, align_student, align_teacher) needed by *_aligned training modes. If you preprocessed with --align_mode none, training in any *_aligned mode will crash with KeyError: 'teacher_input_ids'. Re-run preprocessing without overriding --align_mode.

📁 File Structure

dllm/core/trainers/
├── distill_bd3lm.py        # DistillBD3LMTrainer — all distillation modes (TIDAL, CompDemo, CALM, Reverse CALM, plus baselines)
├── distill_collator.py     # DistillCollator — chunk-level CALM alignment via tokenkit (paper §2.3)
├── bd3lm.py                # BD3LMTrainer (base block diffusion trainer)
├── mdlm.py                 # MDLMTrainer (base masked diffusion trainer)
└── losses/
    └── taid.py             # TIDAL loss implementation (paper §2.1)

examples/a2d/bd3lm/
├── distill.py              # Pipeline A entry: LLaDA2 cross-tokenizer distillation
├── distill_wedlm.py        # Pipeline B entry: WeDLM same-tokenizer distillation
├── distill_utils.py        # Shared utilities (alignment, tokenization)
├── preprocess_distill_data.py       # Data preprocessing for Pipeline A
└── preprocess_distill_wedlm_data.py # Data preprocessing for Pipeline B

scripts/
├── distill_llada2.sh       # One-click training: Pipeline A
├── distill_wedlm.sh        # One-click training: Pipeline B
├── eval_all.sh             # One-click evaluation (8 benchmarks)
├── preprocess_llada2_data.sh   # One-click preprocessing: Pipeline A
└── preprocess_wedlm_data.sh    # One-click preprocessing: Pipeline B

📝 Citation

If you find TIDE useful for your research, please consider citing:

@misc{zhang2026turningtidecrossarchitecturedistillation,
      title={Turning the TIDE: Cross-Architecture Distillation for Diffusion Large Language Models},
      author={Gongbo Zhang and Wen Wang and Ye Tian and Li Yuan},
      year={2026},
      eprint={2604.26951},
      archivePrefix={arXiv},
      primaryClass={cs.CL},
      url={https://arxiv.org/abs/2604.26951},
}

🙏 Acknowledgements

Built on the dLLM library; cross-tokenizer alignment via tokenkit; evaluation through lm-evaluation-harness.

Turning the TIDE: Cross-Architecture Distillation for Diffusion Large Language Models

🌊 The first cross-architecture distillation framework for diffusion LLMs — 8B dense and 16B MoE teachers into a 0.6B student 🌊

Gongbo Zhang1 · Wen Wang2 · Ye Tian1 · Li Yuan1,* 1 Peking University · 2 Zhejiang University (* corresponding author)

Gongbo Zhang¹ · Wen Wang² · Ye Tian¹ · Li Yuan^1,*

¹ Peking University · ² Zhejiang University (^* corresponding author)