Install eval harness

April 30, 2026 ยท View on GitHub

TIDE logo

Turning the TIDE: Cross-Architecture Distillation for Diffusion Large Language Models

๐ŸŒŠ The first cross-architecture distillation framework for diffusion LLMs โ€” 8B dense and 16B MoE teachers into a 0.6B student ๐ŸŒŠ

Gongbo Zhang1 ย ยทย  Wen Wang2 ย ยทย  Ye Tian1 ย ยทย  Li Yuan1,*

1 Peking University ย ยทย  2 Zhejiang University ย  (* corresponding author)

arXiv Project Page HF Paper HF Models HF Datasets License GitHub

This repository is the official implementation of TIDE, the first framework for cross-architecture dLLM distillation. While prior work focuses on step compression within a single architecture, TIDE bridges teachers and students that differ in architecture, attention mechanism, and tokenizer, via three modular components โ€” TIDAL, CompDemo, and Reverse CALM.

TIDE: cross-architecture distillation overview

โœจ Highlights

  1. +1.53 average gain over the non-distilled BD3LM baseline across 8 benchmarks (34.20 vs. 32.67).
  2. +16.48 on HumanEval over the equivalent-size AR baseline (48.78 vs. 32.30) โ€” distilled dLLMs especially excel at code generation.
  3. 22ร— peak-memory reduction vs. the 16B MoE LLaDA2 teacher (1.4 GB vs. 31.3 GB) and 5.2ร— faster inference (6.25 s vs. 32.55 s for 256 tokens on H100), enabling commodity-hardware deployment.

All numbers reported in the paper โ€” see arxiv.org/abs/2604.26951 for full setup and ablations.

๐ŸŒŠ The TIDE Framework

TIDE framework: TIDAL + CompDemo + Reverse CALM

ComponentPaperRoleOne-line description
TIDALยง2.1Scheduling โ€” when to learnDual-axis interpolation along training-progress AND diffusion-timestep axes; deweights the teacher at high masking ratios where it is unreliable. Generalizes prior single-axis interpolation to the diffusion setting.
CompDemoยง2.2Contextual โ€” what to enrichTwo-pass teacher inference with complementary mask splits; every masked position sees ~50% revealed context, raising teacher signal quality at high noise.
Reverse CALMยง2.3Output โ€” how to projectReverse-direction chunk-level binary cross-entropy for cross-tokenizer matching. Bounded gradient coefficient (depends only on the fixed teacher) and dual-end noise filtering; equivalent to a Bernoulli-KL mode-seeking objective.

๐Ÿ”„ Two Pipelines ร— Two Strategies

Headline finding (ยง3.2): each pipeline favors its native strategy.

  • Cross-Tokenizer (LLaDA2 โ†’ BD3LM): native = TIDE-Cross = Reverse CALM. Bounded-gradient mode-seeking tolerates the alignment noise from chunk-level cross-tokenizer matching. Beats the swapped TIDE-Shared by avg +0.37.
  • Shared-Tokenizer (WeDLM โ†’ BD3LM): native = TIDE-Shared = TIDAL + CompDemo (over forward KL). Progressive scheduling and enriched signals work best when token-level alignment is exact. Beats the swapped TIDE-Cross by avg +2.76.
PipelineTeacherStudentTokenizerNative strategyPaper avg
A โ€” Cross-TokenizerLLaDA2.0-mini (16B MoE)Qwen3-0.6B-BD3LMCross (chunk align via tokenkit)TIDE-Cross = Reverse CALM34.20
B โ€” Shared-TokenizerWeDLM-8B-Instruct (8B dense)Qwen3-0.6B-BD3LMShared (vocab 151646)TIDE-Shared = TIDAL + CompDemo33.55

๐Ÿ“Š Main Results

Main results across eight benchmarks. All distillation methods include a cross-entropy loss term. Bold: best among dLLM models; italic: second best.

Benchmark Qwen3-0.6B Shared-Tokenizer Cross-Tokenizer
AR BD3LM KL TIDE-Cross TIDE-Shared CALM TIDE-Shared TIDE-Cross
GSM8K 59.60 45.56 43.97 45.03 48.98 48.60 49.89 52.24
MATH 32.40 13.08 9.40 9.76 11.16 13.14 12.98 13.20
BBH 41.50 26.32 25.79 26.00 26.79 24.21 26.85 27.37
MMLU-Pro 24.70 13.80 13.19 12.88 14.48 13.47 14.02 14.52
HellaSwag 47.40 39.28 39.78 39.50 40.50 40.42 39.57 39.88
MMLU 52.80 39.15 39.57 39.09 39.92 39.42 39.54 39.59
HumanEval 32.30 46.34 41.46 42.68 48.78 43.90 49.39 48.17
MBPP 36.60 37.80 31.20 31.40 37.80 34.80 38.40 38.60
Avg 40.91 32.67 30.55 30.79 33.55 32.25 33.83 34.20

See the paper (ยง3.2) at arxiv.org/abs/2604.26951 for the full discussion.

๐Ÿงญ Paper Variants โ†” Code Modes

This is the only place in the README where the legacy CLI strings alm / taid appear, because the --distill_mode flag values include them.

Paper variantPipelineCommandNotes
CALM (baseline, Cross-Tok)Adistill_llada2.sh --distill_mode almโ€”
TIDE-Cross (native, Cross-Tok)Adistill_llada2.sh --distill_mode reverse_almโ€”
TIDE-Shared (in Cross-Tok pipeline)Adistill_llada2.sh --distill_mode alm_taid --use_comp_demo TrueTIDAL + CompDemo
KL (baseline, Shared-Tok)Bdistill_wedlm.sh --distill_mode kl_alignedโ€”
TIDE-Shared (native, Shared-Tok)Bdistill_wedlm.sh --distill_mode taid_aligned --use_comp_demo TrueTIDAL + CompDemo
TIDE-Cross (in Shared-Tok pipeline)Bdistill_wedlm.sh --distill_mode reverse_kl_alignedโ€”

๐Ÿ’ก Note on combinations. TIDAL is applied only to forward objectives. As discussed in the paper's gradient-analysis appendix, combining TIDAL with reverse objectives is counterproductive โ€” the late-training (1โˆ’ฮปt)(1-\lambda_t) factor suppresses the self-selection mechanism of Reverse CALM.

โš™๏ธ Setup

# Create environment
conda create -n dllm python=3.10 -y && conda activate dllm

# Install PyTorch (CUDA 12.4)
conda install cuda=12.4 -c nvidia
pip install torch==2.6.0 torchvision==0.21.0 torchaudio==2.6.0 \
    --index-url https://download.pytorch.org/whl/cu124

# Install dllm
pip install -e .

# Initialize submodules (lm-evaluation-harness + tokenkit)
git submodule update --init --recursive

# Install eval harness
pip install -e "lm-evaluation-harness[ifeval,math]"

# Install tokenkit (required for Pipeline A cross-tokenizer distillation)
pip install -e "tokenkit[full]"

๐Ÿ“ฆ Released Models & Data

Six distilled student checkpoints (3 per pipeline) are released under ๐Ÿค— TIDE-dllm Models, and two preprocessed SFT datasets are released under ๐Ÿค— TIDE-dllm Datasets.

Distilled student checkpoints

PipelineVariant๐Ÿค— Repo
A โ€” Cross-Tokenizer (LLaDA2 teacher)TIDE-Cross (native)distill-LLaDA2-TIDE_Cross
A โ€” Cross-Tokenizer (LLaDA2 teacher)TIDE-Shared variantdistill-LLaDA2-TIDE_Shared
A โ€” Cross-Tokenizer (LLaDA2 teacher)CALM baselinedistill-LLaDA2-CALM
B โ€” Shared-Tokenizer (WeDLM teacher)TIDE-Shared (native)distill-WeDLM-TIDE_Shared
B โ€” Shared-Tokenizer (WeDLM teacher)TIDE-Cross variantdistill-WeDLM-TIDE_Cross
B โ€” Shared-Tokenizer (WeDLM teacher)KL baselinedistill-WeDLM-KL

Preprocessed SFT datasets

Both datasets share the same composition as dllm-hub/Qwen3-0.6B-diffusion-bd3lm-v0.1 โ€” tulu-3-sft-mixture + smoltalk + opc-sft-stage1 + opc-sft-stage2 โ€” but tokenized for each teacher in advance to avoid NCCL timeouts during distillation.

Pipeline๐Ÿค— Repo
A โ€” for the LLaDA2 teacherdistill_llada2_sft
B โ€” for the WeDLM teacherdistill_wedlm_sft

Download

pip install "huggingface_hub[cli]"

# Distilled checkpoint (example: native TIDE-Cross from Pipeline A)
huggingface-cli download TIDE-dllm/distill-LLaDA2-TIDE_Cross \
    --local-dir ckpts/distill-LLaDA2-TIDE_Cross

# Preprocessed datasets
huggingface-cli download TIDE-dllm/distill_llada2_sft \
    --repo-type dataset --local-dir data/distill_llada2_sft
huggingface-cli download TIDE-dllm/distill_wedlm_sft \
    --repo-type dataset --local-dir data/distill_wedlm_sft

Project page: pku-yuangroup.github.io/TIDE-Page.

๐Ÿš€ Quick Start

1. Data Preprocessing

Distillation requires offline-preprocessed data to avoid NCCL timeout during tokenization. The fastest path is to download our preprocessed datasets from TIDE-dllm (see ๐Ÿ“ฆ Released Models & Data above):

huggingface-cli download TIDE-dllm/distill_llada2_sft \
    --repo-type dataset --local-dir data/distill_llada2_preprocessed
huggingface-cli download TIDE-dllm/distill_wedlm_sft \
    --repo-type dataset --local-dir data/distill_wedlm_preprocessed

If you'd rather preprocess from scratch, the examples below use tatsu-lab/alpaca for a quick smoke test. To reproduce the paper, replace the --dataset value with:

allenai/tulu-3-sft-mixture+HuggingFaceTB/smoltalk+OpenCoder-LLM/opc-sft-stage1[lang:python]+OpenCoder-LLM/opc-sft-stage2[lang:python]

Pipeline A (LLaDA2, cross-tokenizer):

bash scripts/preprocess_llada2_data.sh \
    --dataset tatsu-lab/alpaca \
    --output_dir data/distill_llada2_preprocessed

Pipeline B (WeDLM, same-tokenizer):

bash scripts/preprocess_wedlm_data.sh \
    --dataset tatsu-lab/alpaca \
    --output_dir data/distill_wedlm_preprocessed

2. Distillation Training

The recommended command for each pipeline runs the native strategy (paper-best per ยง3.2).

Pipeline A โ€” LLaDA2 teacher, TIDE-Cross (Reverse CALM):

bash scripts/distill_llada2.sh \
    --data_path data/distill_llada2_preprocessed \
    --distill_mode reverse_alm \
    --num_gpus 8

Pipeline B โ€” WeDLM teacher, TIDE-Shared (TIDAL + CompDemo):

bash scripts/distill_wedlm.sh \
    --data_path data/distill_wedlm_preprocessed \
    --distill_mode taid_aligned \
    --use_comp_demo True \
    --num_gpus 8
๐Ÿ“‹ All training script parameters

Both distill_llada2.sh and distill_wedlm.sh support:

ParameterDefaultDescription
--data_pathrequiredPreprocessed data directory or HF dataset name
--output_diroutput/distill_*Checkpoint output directory
--num_gpus8Number of GPUs
--distill_modealm / taid_alignedDistillation mode (see Paper Variants โ†” Code Modes table above)
--use_comp_demoFalseEnable CompDemo (complementary demonstration)
--epochs2 / 3Number of training epochs
--lr5e-5Learning rate
--batch_size8 / 10Per-device batch size
--student_modeldllm-collection/Qwen3-0.6B-diffusion-bd3lm-v0.1Student model
--teacher_modelinclusionAI/LLaDA2.0-mini / tencent/WeDLM-8B-InstructTeacher model

WeDLM-specific (TIDAL controls):

ParameterDefaultDescription
--taid_axis_modebothTIDAL axis: both, training_only, timestep_only
--taid_timestep_weightmidrangeTimestep weighting: uniform, midrange
--shared_vocab_size151646Shared vocabulary size
--teacher_mask_token_id151665Teacher mask token ID

3. Evaluation

Run all 8 benchmarks on a trained checkpoint:

bash scripts/eval_all.sh --model_path /path/to/checkpoint --num_gpus 8

Benchmarks: mmlu_generative_dream, mmlu_pro, hellaswag_gen, gsm8k_cot, bbh, minerva_math, humaneval_instruct, mbpp_instruct.

Evaluation protocol: block size 32, CFG scale 0.0, sampling steps from 3 (HellaSwag/MMLU) up to 256 (everything else). Results are saved to eval_results/ by default (override with --output_dir).

๐Ÿ“‹ Training Hyperparameters

Training settings used for the paper experiments.

ParameterCross-Tokenizer (Pipeline A)Shared-Tokenizer (Pipeline B)
TeacherLLaDA2.0-mini (16B MoE)WeDLM-8B-Instruct (8B)
Student initQwen3-0.6B-BD3LM SFT v0.1Qwen3-0.6B-BD3LM SFT v0.1
Native methodReverse CALMTIDAL + CompDemo
Learning rate5e-55e-5
Epochs1010
Student / teacher seq length512 / 1024512 / 768
Block size3232
Precisionbfloat16bfloat16
TIDAL ฮปinitโ†’ฮปmaxโก\lambda_{\text{init}} \to \lambda_{\max}โ€”$0.1 \to 0.9$, cosine, midrange weighting
CompDemo demo_ratioโ€”0.5
Temperature TTโ€”2.0
DatasetTulu-3 SFT + SmolTalk + OpenCoder-SFT-1/2 (Python)(same)

๐Ÿ› ๏ธ Troubleshooting

ValueError: Sequence length N exceeds pad_to_length M during training

For *_aligned modes (Pipeline B) the preprocessing script does not truncate samples to --max_length โ€” it only filters samples whose prompt alone exceeds it. The training --max_length (and --teacher_max_length) must therefore be at least as large as the value used during preprocessing. The simplest rule: pass the same --max_length to both preprocess_wedlm_data.sh and distill_wedlm.sh.

Pipeline B taid_aligned requires aligned preprocessed data

The default --align_mode of preprocess_wedlm_data.sh is kl_aligned, which produces the dual-tokenizer fields (teacher_input_ids, align_student, align_teacher) needed by *_aligned training modes. If you preprocessed with --align_mode none, training in any *_aligned mode will crash with KeyError: 'teacher_input_ids'. Re-run preprocessing without overriding --align_mode.

๐Ÿ“ File Structure

dllm/core/trainers/
โ”œโ”€โ”€ distill_bd3lm.py        # DistillBD3LMTrainer โ€” all distillation modes (TIDAL, CompDemo, CALM, Reverse CALM, plus baselines)
โ”œโ”€โ”€ distill_collator.py     # DistillCollator โ€” chunk-level CALM alignment via tokenkit (paper ยง2.3)
โ”œโ”€โ”€ bd3lm.py                # BD3LMTrainer (base block diffusion trainer)
โ”œโ”€โ”€ mdlm.py                 # MDLMTrainer (base masked diffusion trainer)
โ””โ”€โ”€ losses/
    โ””โ”€โ”€ taid.py             # TIDAL loss implementation (paper ยง2.1)

examples/a2d/bd3lm/
โ”œโ”€โ”€ distill.py              # Pipeline A entry: LLaDA2 cross-tokenizer distillation
โ”œโ”€โ”€ distill_wedlm.py        # Pipeline B entry: WeDLM same-tokenizer distillation
โ”œโ”€โ”€ distill_utils.py        # Shared utilities (alignment, tokenization)
โ”œโ”€โ”€ preprocess_distill_data.py       # Data preprocessing for Pipeline A
โ””โ”€โ”€ preprocess_distill_wedlm_data.py # Data preprocessing for Pipeline B

scripts/
โ”œโ”€โ”€ distill_llada2.sh       # One-click training: Pipeline A
โ”œโ”€โ”€ distill_wedlm.sh        # One-click training: Pipeline B
โ”œโ”€โ”€ eval_all.sh             # One-click evaluation (8 benchmarks)
โ”œโ”€โ”€ preprocess_llada2_data.sh   # One-click preprocessing: Pipeline A
โ””โ”€โ”€ preprocess_wedlm_data.sh    # One-click preprocessing: Pipeline B

๐Ÿ“ Citation

If you find TIDE useful for your research, please consider citing:

@misc{zhang2026turningtidecrossarchitecturedistillation,
      title={Turning the TIDE: Cross-Architecture Distillation for Diffusion Large Language Models},
      author={Gongbo Zhang and Wen Wang and Ye Tian and Li Yuan},
      year={2026},
      eprint={2604.26951},
      archivePrefix={arXiv},
      primaryClass={cs.CL},
      url={https://arxiv.org/abs/2604.26951},
}

๐Ÿ™ Acknowledgements

Built on the dLLM library; cross-tokenizer alignment via tokenkit; evaluation through lm-evaluation-harness.