README.md

April 29, 2026 · View on GitHub

Out of Sight but Not Out of Mind:
Hybrid Memory for Dynamic Video World Models

Kaijin Chen¹, Dingkang Liang¹, Xin Zhou¹, Yikang Ding¹, Xiaoqiang Liu², Pengfei Wan², Xiang Bai¹

¹Huazhong University of Science and Technology ²Kling Team, Kuaishou Technology

demo

Overview

Recent video world models are good at simulating static environments, but they still struggle with a core challenge of the real world: dynamic subjects frequently move out of view and later re-enter the scene. In these cases, many existing methods lose subject identity or motion continuity, producing frozen, distorted, or disappearing objects.

HyDRA is built for this setting. It introduces a hybrid memory mechanism that treats world modeling as both:

remembering stable scene structure, and
tracking dynamic subjects through unseen intervals.

To support this direction, we also introduce HM-World, a large-scale dataset designed for studying dynamic memory in video world models.

Abstract

Video world models have shown immense potential in simulating the physical world, yet existing memory mechanisms primarily treat environments as static canvases. When dynamic subjects hide out of sight and later re-emerge, current methods often struggle, leading to frozen, distorted, or vanishing subjects. We introduce Hybrid Memory, a novel paradigm requiring models to simultaneously act as precise archivists for static backgrounds and vigilant trackers for dynamic subjects, ensuring motion continuity during out-of-view intervals. To facilitate research in this direction, we construct HM-World, the first large-scale video dataset dedicated to hybrid memory. It features 59K high-fidelity clips with decoupled camera and subject trajectories, encompassing 17 diverse scenes, 49 distinct subjects, and meticulously designed exit-entry events to rigorously evaluate hybrid coherence. Furthermore, we propose HyDRA, a specialized memory architecture that compresses contexts into memory tokens and utilizes a spatiotemporal relevance-driven retrieval mechanism. By selectively attending to relevant motion cues, HyDRA effectively preserves the identity and motion of hidden subjects. Extensive experiments on HM-World demonstrate that our method significantly outperforms state-of-the-art approaches in both dynamic subject consistency and overall generation quality.

Highlights

A new problem setting for video world models: preserving subject identity and motion after out-of-view intervals.
HM-World dataset with 59K high-fidelity clips for hybrid memory research.
HyDRA architecture with memory tokenization and spatiotemporal relevance-based retrieval.
Open-source release of inference code, training skeleton, examples, and model checkpoints.

Overview
News and Roadmap
Getting Started
Inference
Training
Dataset
Citation

Generation Results

More visual results are available on the project homepage.

Experimental Results

News and Roadmap

Paper released
HM-World dataset released
HyDRA checkpoints and inference code released
HyDRA training code released

Getting Started

1. Clone the repository

git clone https://github.com/H-EmbodVis/HyDRA.git
cd HyDRA

2. Create the environment

conda create -n hydra python=3.10 -y
conda activate hydra
pip install -r requirements.txt

3. Download the base video model

HyDRA builds on Wan2.1-T2V-1.3B.

Base model: Wan-AI/Wan2.1-T2V-1.3B
Recommended location: ./ckpts/Wan2.1-T2V-1.3B/

Expected structure:

HyDRA/
|- ckpts/
|  |- hydra.ckpt
|  |- Wan2.1-T2V-1.3B/
|     |- Wan2.1_VAE.pth
|     |- diffusion_pytorch_model.safetensors
|     |- models_t5_umt5-xxl-enc-bf16.pth
|     |- ...
|- assets/
|- diffsynth/
|- examples/
|- infer_hydra.py
|- train_hydra.py
|- requirements.txt

4. Download the HyDRA checkpoint

Checkpoint: H-EmbodVis/HyDRA
Recommended path: ./ckpts/hydra.ckpt

5. Run a quick sanity check

The repository already includes example videos, camera trajectories, and captions under ./examples.

python infer_hydra.py

Generated videos will be saved to ./outputs by default.

Inference

Run all packaged examples

python infer_hydra.py \
  --examples_dir ./examples \
  --ckpt_path ./ckpts/hydra.ckpt

Run a single custom case

python infer_hydra.py \
  --cond_video ./path/to/cond_video.mp4 \
  --cond_json ./path/to/camera.json \
  --caption_txt ./path/to/prompt.txt \
  --ckpt_path ./ckpts/hydra.ckpt \
  --output_path ./outputs/custom_concat.mp4

Training

This repository provides the HyDRA model definition and a training skeleton. You can plug in your own dataset and DataLoader with PyTorch Lightning.

Data preparation

For each training sample:

Encode the source video into latent representations with the VAE.
Encode the caption into text embeddings with the text encoder.
Convert camera poses into the relative coordinate system expected by HyDRA.
Save the processed sample in a format your custom dataset loader can read.

Initialize the training model

python train_hydra.py \
  --dit_path ./ckpts/Wan2.1-T2V-1.3B/diffusion_pytorch_model.safetensors \
  --hydra \
  --use_gradient_checkpointing

train_hydra.py initializes the training module but does not ship with a built-in dataset or trainer loop. To train end-to-end, connect it to your own Dataset, DataLoader, and pl.Trainer(...).fit(...).

Dataset

We release HM-World, a large-scale dataset tailored for hybrid memory research in dynamic video world models.

Dataset page: KlingTeam/HM-World
Focus: decoupled camera motion and subject motion
Designed for: exit-entry events, dynamic continuity, and long-horizon memory evaluation

If you use HM-World in your work, please cite the paper below.

Acknowledgement

We thank the authors and teams behind the following projects and open-source efforts:

Citation

If you find this project useful in your research, please consider citing:

@article{chen2026out,
  title   = {Out of Sight but Not Out of Mind: Hybrid Memory for Dynamic Video World Models},
  author  = {Chen, Kaijin and Liang, Dingkang and Zhou, Xin and Ding, Yikang and Liu, Xiaoqiang and Wan, Pengfei and Bai, Xiang},
  journal = {arXiv preprint arXiv:2603.25716},
  year    = {2026}
}

Out of Sight but Not Out of Mind:Hybrid Memory for Dynamic Video World Models

Out of Sight but Not Out of Mind:
Hybrid Memory for Dynamic Video World Models