README.md
April 29, 2026 · View on GitHub
Out of Sight but Not Out of Mind:
Hybrid Memory for Dynamic Video World Models
Kaijin Chen1, Dingkang Liang1, Xin Zhou1, Yikang Ding1, Xiaoqiang Liu2, Pengfei Wan2, Xiang Bai1
1Huazhong University of Science and Technology 2Kling Team, Kuaishou Technology
Overview
Recent video world models are good at simulating static environments, but they still struggle with a core challenge of the real world: dynamic subjects frequently move out of view and later re-enter the scene. In these cases, many existing methods lose subject identity or motion continuity, producing frozen, distorted, or disappearing objects.
HyDRA is built for this setting. It introduces a hybrid memory mechanism that treats world modeling as both:
- remembering stable scene structure, and
- tracking dynamic subjects through unseen intervals.
To support this direction, we also introduce HM-World, a large-scale dataset designed for studying dynamic memory in video world models.
Abstract
Video world models have shown immense potential in simulating the physical world, yet existing memory mechanisms primarily treat environments as static canvases. When dynamic subjects hide out of sight and later re-emerge, current methods often struggle, leading to frozen, distorted, or vanishing subjects. We introduce Hybrid Memory, a novel paradigm requiring models to simultaneously act as precise archivists for static backgrounds and vigilant trackers for dynamic subjects, ensuring motion continuity during out-of-view intervals. To facilitate research in this direction, we construct HM-World, the first large-scale video dataset dedicated to hybrid memory. It features 59K high-fidelity clips with decoupled camera and subject trajectories, encompassing 17 diverse scenes, 49 distinct subjects, and meticulously designed exit-entry events to rigorously evaluate hybrid coherence. Furthermore, we propose HyDRA, a specialized memory architecture that compresses contexts into memory tokens and utilizes a spatiotemporal relevance-driven retrieval mechanism. By selectively attending to relevant motion cues, HyDRA effectively preserves the identity and motion of hidden subjects. Extensive experiments on HM-World demonstrate that our method significantly outperforms state-of-the-art approaches in both dynamic subject consistency and overall generation quality.
Highlights
- A new problem setting for video world models: preserving subject identity and motion after out-of-view intervals.
- HM-World dataset with 59K high-fidelity clips for hybrid memory research.
- HyDRA architecture with memory tokenization and spatiotemporal relevance-based retrieval.
- Open-source release of inference code, training skeleton, examples, and model checkpoints.
Table of Contents
Generation Results
More visual results are available on the project homepage.
Experimental Results
News and Roadmap
- Paper released
- HM-World dataset released
- HyDRA checkpoints and inference code released
- HyDRA training code released
Getting Started
1. Clone the repository
git clone https://github.com/H-EmbodVis/HyDRA.git
cd HyDRA
2. Create the environment
conda create -n hydra python=3.10 -y
conda activate hydra
pip install -r requirements.txt
3. Download the base video model
HyDRA builds on Wan2.1-T2V-1.3B.
- Base model: Wan-AI/Wan2.1-T2V-1.3B
- Recommended location:
./ckpts/Wan2.1-T2V-1.3B/
Expected structure:
HyDRA/
|- ckpts/
| |- hydra.ckpt
| |- Wan2.1-T2V-1.3B/
| |- Wan2.1_VAE.pth
| |- diffusion_pytorch_model.safetensors
| |- models_t5_umt5-xxl-enc-bf16.pth
| |- ...
|- assets/
|- diffsynth/
|- examples/
|- infer_hydra.py
|- train_hydra.py
|- requirements.txt
4. Download the HyDRA checkpoint
- Checkpoint: H-EmbodVis/HyDRA
- Recommended path:
./ckpts/hydra.ckpt
5. Run a quick sanity check
The repository already includes example videos, camera trajectories, and captions under ./examples.
python infer_hydra.py
Generated videos will be saved to ./outputs by default.
Inference
Run all packaged examples
python infer_hydra.py \
--examples_dir ./examples \
--ckpt_path ./ckpts/hydra.ckpt
Run a single custom case
python infer_hydra.py \
--cond_video ./path/to/cond_video.mp4 \
--cond_json ./path/to/camera.json \
--caption_txt ./path/to/prompt.txt \
--ckpt_path ./ckpts/hydra.ckpt \
--output_path ./outputs/custom_concat.mp4
Training
This repository provides the HyDRA model definition and a training skeleton. You can plug in your own dataset and DataLoader with PyTorch Lightning.
Data preparation
For each training sample:
- Encode the source video into latent representations with the VAE.
- Encode the caption into text embeddings with the text encoder.
- Convert camera poses into the relative coordinate system expected by HyDRA.
- Save the processed sample in a format your custom dataset loader can read.
Initialize the training model
python train_hydra.py \
--dit_path ./ckpts/Wan2.1-T2V-1.3B/diffusion_pytorch_model.safetensors \
--hydra \
--use_gradient_checkpointing
train_hydra.py initializes the training module but does not ship with a built-in dataset or trainer loop. To train end-to-end, connect it to your own Dataset, DataLoader, and pl.Trainer(...).fit(...).
Dataset
We release HM-World, a large-scale dataset tailored for hybrid memory research in dynamic video world models.
- Dataset page: KlingTeam/HM-World
- Focus: decoupled camera motion and subject motion
- Designed for: exit-entry events, dynamic continuity, and long-horizon memory evaluation
If you use HM-World in your work, please cite the paper below.
Acknowledgement
We thank the authors and teams behind the following projects and open-source efforts:
Citation
If you find this project useful in your research, please consider citing:
@article{chen2026out,
title = {Out of Sight but Not Out of Mind: Hybrid Memory for Dynamic Video World Models},
author = {Chen, Kaijin and Liang, Dingkang and Zhou, Xin and Ding, Yikang and Liu, Xiaoqiang and Wan, Pengfei and Bai, Xiang},
journal = {arXiv preprint arXiv:2603.25716},
year = {2026}
}