๐ minWM: Full-Stack Open-Source Video World Model Framework
June 3, 2026 ยท View on GitHub
A full-stack framework and tutorial for newcomers, rather than a specific model.
minWM is our contribution to the world-model community: a full-stack open-source framework that walks you end-to-end through turning a bidirectional T2V foundation model into an action-conditioned video world model โ with example data, runnable scripts, Claude skills capturing our hands-on experience, and onboarding knowledge for newcomers. We hope more researchers and developers join us in growing the community together.
๐ฌ Demo
https://github.com/user-attachments/assets/99c25915-7fe7-4a20-a2c4-9d291502fccf
๐ฅ News
- 2026-05-29 ๐ We release the technical report.
- 2026-05-17 ๐ We release minWM โ the first full-stack open-source world model framework.
๐ Table of Contents
- ๐ฌ Demo
- ๐ฅ News
- โจ Why minWM?
- ๐ ๏ธ Installation
- ๐งฑ Model Checkpoints
- ๐ Quick Start
- โ๏ธ Data & Training & Reproduction
- ๐ Citation
- Contact
- ๐ Acknowledgements
โจ Why minWM?
1. Full-Stack Framework
The complete data โ training โ inference pipeline is open-sourced; every stage exposes input/output checkpoints so you can stop, swap, or fork anywhere.
1.1 Data. We walk you through how to construct training-ready datasets paired with camera poses, and the full data processing pipeline that turns them into latents.
1.2 Training. Including FSDP + sequence parallelism, single-/multi-node training, and the full distillation pipeline from a bidirectional diffusion model to a 4-step AR student:
Phase 1 Phase 2 โ Distillation to Causal Few-Step
โโโโโโโโโโโโโโโโโโโโโ โโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโ
Bidirectional SFT โโโถ Stage 1 Teacher Forcing AR Diffusion
Stage 2a Causal ODE (proposed in [Causal Forcing](https://arxiv.org/abs/2602.02214))
Stage 2b Causal CD (proposed in [Causal Forcing++](https://arxiv.org/abs/2605.15141))
Stage 3 Asymmetric DMD with Self Rollout
โผ
4-step real-time
1.3 Inference.
- โ
4-step DMD inference for HY Action2V / HY TI2V / Wan Action2V, multi-GPU sequence parallelism, camera-trajectory control via pose strings (
"a*4,w*8,s*7") or JSON files - ๐ง Inference acceleration [TBD]
2. Multi-Backbone Support
minWM supports two paths to arriving at an interactive world model.
2.1 From Scratch: Bidirectional T2V Foundation โ Real-Time World Model
The HunyuanVideo 1.5 and Wan 2.1 lines walk through the full 4-stage pipeline โ starting from a bidirectional T2V foundation model and ending at a 4-step autoregressive world model.
| Backbone | Architecture | Params | Training | Inference |
|---|---|---|---|---|
| Wan 2.1 | Cross-attention + DiT | 1.3 B | โ all 4 stages | โ 4-step DMD |
| HunyuanVideo 1.5 | MMDiT | 8 B | โ all 4 stages | โ 4-step DMD |
Both lines share the same trainer / loss / dataset abstractions, so adding a third backbone is structurally a wrapper-and-config exercise.
2.2 Finetuning an Existing Video World Model ๐ง [TBD]
The forthcoming worldplay-finetune entry will let you start from an already-trained video world model and adapt it to new conditions, scenes, or resolutions โ without rerunning the 4-stage pipeline from scratch.
3. Multi-Condition Injection
We aim to support both multiple condition types and multiple injection methods, mixable along either axis.
3.1 Supported Conditions
- โ Camera pose
- ๐ง Human pose [TBD]
3.2 Supported Injection Methods
- โ ProPE
- ๐ง Latent concat [TBD]
- ๐ง Cross-attention [TBD]
4. Claude Skills โ Modify the Framework with an LLM Assistant
We are packaging our project experience across the CF / CF++ pipeline as Claude skills, so that an LLM assistant can help users debug failures and integrate new models without reverse-engineering the whole repo.
- ๐
debug-world-modelโ collected failure modes from the training pipeline (loss NaN, frame-to-frame jitter, camera drift, memory attenuation, distillation collapse, โฆ). Claude diagnoses likely root causes from your symptoms instead of guessing. - ๐
integrate-new-backboneโ step-by-step recipe for plugging a new video DiT into minWM, grounded in the HunyuanVideo and Wan reference integrations โ e.g. "look at how HY does teacher forcing here, do the same for your model there".
5. Onboarding Knowledge โ for Newcomers to World Models
onboarding-world-model
A third Claude skill aimed at researchers entering the world-model space for the first time. Two parts:
- ๐ Foundations โ the minimal background to follow the pipeline: Teacher Forcing for AR diffusion training and Causal Forcing & Causal Forcing++ for AR diffusion distillation.
- ๐ชค Pitfalls โ the non-obvious mistakes we hit while building minWM, distilled so you don't repeat them.
Intended audience: graduate students, independent researchers, and junior labs that want to enter the world-model space without spending three months reverse-engineering existing repos.
๐ ๏ธ Installation
conda create -n minwm python=3.10 -y
conda activate minwm
pip install -r requirements.txt
pip install flash-attn --no-build-isolation
export PYTHONPATH="$PWD/HY15:$PWD/Wan21:$PWD/shared:$PYTHONPATH"
๐งฑ Model Checkpoints (Click to expand)
All weights live under ./ckpts/ after download.
| Checkpoint | Backbone | Stage | Use case | Download |
|---|---|---|---|---|
Wan21/Action2V/{bidirectional,ar_diffusion_tf,causal_ode,causal_cd,dmd} | Wan 2.1 | Same 4 stages | Wan pipeline | HF |
HunyuanVideo-1.5 (base) | HY 1.5 | โ | Required by both HY pipelines | HF |
Wan2.1-T2V-1.3B (base) | Wan 2.1 | โ | Required by Wan pipeline | HF |
HY15/Action2V/bidirectional | HY 1.5 | Phase 1 SFT | Starting point for HY Action2V Phase 2 | HF |
HY15/Action2V/ar_diffusion_tf | HY 1.5 | Phase 2 Stage 1 | Teacher Forcing AR diffusion | HF |
HY15/Action2V/causal_ode | HY 1.5 | Phase 2 Stage 2a (proposed in Causal Forcing) | DMD initialization | HF |
HY15/Action2V/causal_cd | HY 1.5 | Phase 2 Stage 2b (proposed in Causal Forcing++) | DMD initialization | HF |
HY15/Action2V/dmd | HY 1.5 | Phase 2 Stage 3 | 4-step real-time inference | HF |
HY15/TI2V/{bidirectional,ar_diffusion_tf,causal_ode,causal_cd,dmd} | HY 1.5 | Same 4 stages, TI2V variant | TI2V pipeline | HF |
๐ Quick Start
The fastest path: install โ download three DMD checkpoints โ run three demo commands. Full reproduction (all 4 training stages ร 3 model lines) is in ยง Data & Training & Reproduction.
1. Download the demo checkpoints
# Wan base (T2V-1.3B)
hf download Wan-AI/Wan2.1-T2V-1.3B --local-dir ./ckpts/Wan2.1-T2V-1.3B
# Code hardcodes the load path; create a symlink.
mkdir -p Wan21/wan_models
ln -s "$(realpath ./ckpts/Wan2.1-T2V-1.3B)" Wan21/wan_models/Wan2.1-T2V-1.3B
# HY base + text/vision encoders (required by HY pipelines)
hf download tencent/HunyuanVideo-1.5 --local-dir ./ckpts/HunyuanVideo-1.5 \
--include "vae/*" "scheduler/*" "transformer/480p_i2v/*"
hf download Qwen/Qwen2.5-VL-7B-Instruct --local-dir ./ckpts/HunyuanVideo-1.5/text_encoder/llm
hf download google/byt5-small --local-dir ./ckpts/HunyuanVideo-1.5/text_encoder/byt5-small
modelscope download --model AI-ModelScope/Glyph-SDXL-v2 \
--local_dir ./ckpts/HunyuanVideo-1.5/text_encoder/Glyph-SDXL-v2
hf download black-forest-labs/FLUX.1-Redux-dev \
--local-dir ./ckpts/HunyuanVideo-1.5/vision_encoder/siglip --token <your_hf_token>
# 4-step DMD checkpoints
## Wan Action2V (DMD, 4-step)
hf download MIN-Lab/minWM --local-dir ./ckpts \
--include "Wan21/Action2V/dmd/*"
## HY Action2V (DMD, 4-step, worldplay teacher)
hf download MIN-Lab/minWM --local-dir ./ckpts \
--include "HY15/Action2V/dmd/*"
# HY Action2V (DMD, 4-step, our bidirectional teacher)
# hf download MIN-Lab/minWM --local-dir ./ckpts \
# --include "HY15/Action2V/dmd_ourbi/*"
## HY TI2V (DMD, 4-step)
hf download MIN-Lab/minWM --local-dir ./ckpts \
--include "HY15/TI2V/dmd/*"
2. Run the three demos
# 2.1 Wan Action2V (4-step DMD, camera control)
OUTPUT_FOLDER=./outputs/quickstart_wan_action2v \
TRAJECTORY_PATH="Wan21/prompts/trajectories.txt" \
bash Wan21/scripts/inference/run_infer_causal_camera.sh
# 2.2 HY Action2V (4-step DMD, camera control)
TRANSFORMER_DIR=./ckpts/HY15/Action2V/dmd \
OUTPUT_DIR=./outputs/quickstart_hy_action2v \
bash HY15/scripts/inference/run_infer_causal_camera.sh
# 2.3 HY TI2V (4-step DMD)
TRANSFORMER_DIR=./ckpts/HY15/TI2V/dmd \
OUTPUT_DIR=./outputs/quickstart_hy_ti2v \
bash HY15/scripts/inference/run_infer_causal.sh
Camera control. For HY Action2V, trajectories are read per-sample from
assets/example.jsonunder the"trajectory"field. Format:w/s/a/dkeys with*Nrepeats; comma-separated segments โ e.g."a*4,w*8,s*7".
โ๏ธ Data & Training & Reproduction
Three model lines ร two phases ร four stages, each documented as (1) Model download โ (2) Data preparation โ (3) Training script โ (4) Validation. Full reproduction guides are split by backbone:
- ๐
training_wan.md- Wan Action2V (Wan 2.1 backbone)
- ๐
training_hunyuan.mdโ HY Action2V (HY1.5-8B backbone)- HY TI2V (HY1.5-8B backbone)
๐ Citation
If minWM helps your research, please cite:
# ICML 2026
@article{zhu2026causal,
title={Causal Forcing: Autoregressive Diffusion Distillation Done Right for High-Quality Real-Time Interactive Video Generation},
author={Zhu, Hongzhou and Zhao, Min and He, Guande and Su, Hang and Li, Chongxuan and Zhu, Jun},
journal={arXiv preprint arXiv:2602.02214},
year={2026}
}
# Technical Report
@article{zhao2026causal,
title={Causal Forcing++: Scalable Few-Step Autoregressive Diffusion Distillation for Real-Time Interactive Video Generation},
author={Zhao, Min and Zhu, Hongzhou and Zheng, Kaiwen and Zhou, Zihan and Yan, Bokai and Li, Xinyuan and Yang, Xiao and Li, Chongxuan and Zhu, Jun},
journal={arXiv preprint arXiv:2605.15141},
year={2026}
}
# Technical Report
@article{zhao2026minwm,
title={minWM: A Full-Stack Open-Source Framework for Real-Time Interactive Video World Models},
author={Zhao, Min and Zhu, Hongzhou and Yan, Bokai and Zhou, Zihan and Chen, Yimin and Sun, Wenqiang and Zheng, Kaiwen and He, Guande and Yang, Xiao and Li, Chongxuan and others},
journal={arXiv preprint arXiv:2605.30263},
year={2026}
}
Contact
For questions, suggestions, or collaboration, please open a GitHub issue or contact: gracezhao1997@gmail.com.
๐ Acknowledgements
minWM stands on the shoulders of giants. We thank the authors and maintainers of HunyuanVideo 1.5, HY-WorldPlay, Wan 2.1, Causal-Forcing, and FastVideo for their open-source contributions, which made this framework possible.