InSpatio-World

April 13, 2026 · View on GitHub

HuggingFace Project Page License arXiv

Discord

Requirements

  • Python 3.10
  • CUDA 12.1

1. Create conda environment:

conda env create -f environment.yml
conda activate inspatio_world

2. Install flash-attn:

pip install https://github.com/Dao-AILab/flash-attention/releases/download/v2.7.4.post1/flash_attn-2.7.4.post1+cu12torch2.5cxx11abiFALSE-cp310-cp310-linux_x86_64.whl

Model Weights

Download the following model checkpoints into the checkpoints/ directory:

ModelPurposeSource
InSpatio-World-1.3Bv2v inference — 1.3B (Step 3)HuggingFace
Wan2.1-T2V-1.3BText encoder + VAE + base model for 1.3B (Step 3)HuggingFace
DA3 (Depth-Anything-3)Depth estimation (Step 2)HuggingFace
Florence-2-largeVideo captioning (Step 1)HuggingFace
TAEHVSpeed up (Optional)Github
bash scripts/download.sh

Expected directory structure after downloading:

checkpoints/
├── InSpatio-World-1.3B/
│   └── InSpatio-World-1.3B.safetensors
├── Wan2.1-T2V-1.3B/
├── DA3/
├── Florence-2-large/
└── taehv/

Inference

The full pipeline runs in three steps:

  1. Step 1 — Generate video captions using Florence-2。
  2. Step 2 — Estimate depth with DA3, convert to inference format, render point clouds
  3. Step 3 — Run InSpatio-World v2v inference

All steps are wrapped in a single script:

bash run_test_pipeline.sh \
  --input_dir ./test/example \
  --traj_txt_path ./traj/x_y_circle_cycle.txt 

Quick Start

# 1. Place your .mp4 video(s) in a folder
mkdir -p my_videos
cp your_video.mp4 my_videos/

# 2. Run the full pipeline
bash run_test_pipeline.sh \
  --input_dir ./my_videos \
  --traj_txt_path ./traj/x_y_circle_cycle.txt

# 3. Results will be saved to ./output/my_videos/x_y_circle_cycle/

Trajectory Control

The --traj_txt_path argument controls the camera trajectory for novel-view synthesis. Predefined trajectories are provided in the traj/ directory:

FileMotion
x_y_circle_cycle.txtCyclic combined pitch + yaw orbit
zoom_out_in.txtDolly zoom out + Dolly zoom in

Trajectory File Format

A trajectory file is a plain text file with 3 lines, each containing space-separated keyframe values that are automatically interpolated to match the output frame count:

<line 1>  pitch (degrees): positive = orbit up, negative = orbit down
<line 2>  yaw (degrees):   positive = orbit left, negative = orbit right
<line 3>  displacement:    relative camera displacement scale

Line 3 (displacement) is a relative scale multiplied by the scene's estimated foreground depth:

  • When pitch/yaw are non-zero, it controls the orbit radius (typically set to 1)
  • When both pitch and yaw are zero, it becomes a dolly zoom: positive = move forward (zoom in), negative = move backward (zoom out)

All Arguments

ArgumentRequiredDefaultDescription
--input_dirYesInput folder containing .mp4 files
--traj_txt_pathYesTrajectory file (e.g. ./traj/x_y_circle_cycle.txt)
--checkpoint_pathNo./checkpoints/InSpatio-World/InSpatio-World.safetensorsInSpatio-World checkpoint
--config_pathNoconfigs/inference.yamlConfig file (inference_1.3b.yaml for 1.3B)
--da3_model_pathNo./checkpoints/DA3DA3 depth model path
--florence_model_pathNo./checkpoints/Florence-2-largeFlorence-2 model path
--step1_gpusNo0GPU ID(s) for Step 1 (comma-separated for parallel)
--step2_gpusNo0GPU ID(s) for Step 2 (comma-separated for parallel)
--step3_gpusNo0GPU ID(s) for Step 3
--step3_nprocNo1Number of GPUs for Step 3
--output_folderNo./output/<name>/<traj>Custom output directory
--master_portNo29513Master port for torchrun (Step 3)
--skip_step1NofalseSkip caption generation
--skip_step2NofalseSkip depth estimation
--skip_step3NofalseSkip v2v inference
--relative_to_sourceNofalseCompose trajectory poses relative to initial view
--rotation_onlyNofalseOnly apply rotation from trajectory, ignore translation (tripod pan/tilt)
--disable_adaptive_frameNofalseDisable adaptive frame expansion/subsampling (use original frame count as-is)
--freeze_repeatNo0Repeat a specific frame N extra times to create a time-freeze (pause) effect
--freeze_frameNomiddle frameFrame index to freeze; defaults to the middle frame if not specified
--use_taeNofalseUse Tiny Auto Encoder (TAE) instead of WanVAE
--tae_checkpoint_pathNo./checkpoints/taehv/taew2_1.pthPath to TAE checkpoint file (required when --use_tae is set)
--compile_ditNofalseApply torch.compile to the DiT model

Skip Already-Completed Steps

If Step 1 or Step 2 outputs already exist, you can skip them:

bash run_test_pipeline.sh \
  --input_dir ./my_videos \
  --traj_txt_path ./traj/x_y_circle_cycle.txt \
  --skip_step1 --skip_step2

Generate Temporal Control Videos

bash run_test_pipeline.sh \
  --input_dir ./test/example \
  --traj_txt_path ./traj/x_y_circle_cycle.txt \
  --freeze_repeat 150 \
  --output_folder ./output/example_freeze_repeat_150 \
  --disable_adaptive_frame

You can control the time-stop behavior using two specific parameters: use --freeze_frame to choose which frame to freeze (default middle frame), and --freeze_repeat to determine the duration (number of frames) of the pause.

Autonomous Driving Applications

bash run_test_pipeline.sh \
  --input_dir ./test/example3 \
  --traj_txt_path ./traj/x_y_circle_cycle.txt \
  --relative_to_source \
  --rotation_only \
  --disable_adaptive_frame

Speed Up

bash run_test_pipeline.sh \
  --input_dir ./test/example \
  --traj_txt_path ./traj/x_y_circle_cycle.txt \
  --use_tae \
  --disable_adaptive_frame 

You can switch from VAE to TAE to accelerate the process. Furthermore, you can use --compile_dit to further boost the speed, reaching 24 fps on an H-series NVIDIA GPU (1.3B). However, please note that this operation requires a relatively long warm-up time when triggered for the first time. It is suitable for scenarios where you need to deploy as a service and pursue extreme speed.

License

This project is licensed under the Apache-2.0 License. Note that this license only applies to code in our library, the dependencies and submodules of which (Depth-Anything-3, Florence-2, TAEHV) are separate and individually licensed.


Citation

If you use InSpatio-World in your research, please use the following BibTeX entry.

@misc{inspatio-world,
    title={INSPATIO-WORLD: A Real-Time 4D World Simulator via Spatiotemporal Autoregressive Modeling},
    author={InSpatio Team},
    journal={arXiv preprint arXiv: 2604.07209},
    year={2026}
}

Acknowledgement

InSpatio-World utilizes a backbone based on Wan2.1, with its training code referencing Self-Forcing. Additionally, the TAE component for inference speed-up is built upon TAEV. We sincerely thank the Self-Forcing, Wan and TAEV team for their foundational work and open-source contribution. We also gratefully acknowledge Depth-Anything-3, Florence-2 and ReCamMaster for their excellent work that inspired and supported this project.