InSpatio-World

July 24, 2026 · View on GitHub

Requirements

Python 3.10
CUDA 12.1
FlashAttention-3 is optional and requires Hopper GPUs such as H100/H800 and a CUDA toolkit with nvcc >= 12.3.

1. Create conda environment:

conda env create -f environment.yml
conda activate inspatio_world

2. Install flash-attn:

pip install https://github.com/Dao-AILab/flash-attention/releases/download/v2.7.4.post1/flash_attn-2.7.4.post1+cu12torch2.5cxx11abiFALSE-cp310-cp310-linux_x86_64.whl

3. Optional: install FlashAttention-3 on Hopper GPUs:

git clone https://github.com/Dao-AILab/flash-attention.git
cd flash-attention/hopper
python setup.py install

The inference code automatically uses FlashAttention-3 when flash_attn_interface can be imported and the current GPU is Hopper/H100. Otherwise it falls back to FlashAttention-2. Using FlashAttention-3 can further improve inference speed.

Model Weights

Download the following model checkpoints into the checkpoints/ directory:

Model	Purpose	Source
InSpatio-World-1.3B	v2v inference — 1.3B (Step 3)	HuggingFace
Wan2.1-T2V-1.3B	Text encoder + VAE + base model for 1.3B (Step 3)	HuggingFace
DA3 (Depth-Anything-3)	Depth estimation (Step 2)	HuggingFace
Florence-2-large	Video captioning (Step 1)	HuggingFace
TAEHV	Speed up (Optional)	Github

bash scripts/download.sh

Expected directory structure after downloading:

checkpoints/
├── InSpatio-World-1.3B/
│   └── InSpatio-World-1.3B.safetensors
├── Wan2.1-T2V-1.3B/
├── DA3/
├── Florence-2-large/
└── taehv/

Inference

The full pipeline runs in three steps:

Step 1 — Generate video captions using Florence-2。
Step 2 — Estimate depth with DA3, convert to inference format, render point clouds
Step 3 — Run InSpatio-World v2v inference

All steps are wrapped in a single script:

bash run_test_pipeline.sh \
  --input_dir ./test/example \
  --traj_txt_path ./traj/x_y_circle_cycle.txt

Quick Start

# 1. Place your .mp4 video(s) in a folder
mkdir -p my_videos
cp your_video.mp4 my_videos/

# 2. Run the full pipeline
bash run_test_pipeline.sh \
  --input_dir ./my_videos \
  --traj_txt_path ./traj/x_y_circle_cycle.txt

# 3. Results will be saved to ./output/my_videos/x_y_circle_cycle/

Trajectory Control

The --traj_txt_path argument controls the camera trajectory for novel-view synthesis. Predefined trajectories are provided in the traj/ directory:

File	Motion
`x_y_circle_cycle.txt`	Cyclic combined pitch + yaw orbit
`zoom_out_in.txt`	Dolly zoom out + Dolly zoom in

Trajectory File Format

A trajectory file is a plain text file with 3 lines, each containing space-separated keyframe values that are automatically interpolated to match the output frame count:

<line 1>  pitch (degrees): positive = orbit up, negative = orbit down
<line 2>  yaw (degrees):   positive = orbit left, negative = orbit right
<line 3>  displacement:    relative camera displacement scale

Line 3 (displacement) is a relative scale multiplied by the scene's estimated foreground depth:

When pitch/yaw are non-zero, it controls the orbit radius (typically set to 1)
When both pitch and yaw are zero, it becomes a dolly zoom: positive = move forward (zoom in), negative = move backward (zoom out)

All Arguments

Argument	Required	Default	Description
`--input_dir`	Yes	—	Input folder containing `.mp4` files
`--traj_txt_path`	Yes	—	Trajectory file (e.g. `./traj/x_y_circle_cycle.txt`)
`--checkpoint_path`	No	`./checkpoints/InSpatio-World/InSpatio-World.safetensors`	InSpatio-World checkpoint
`--config_path`	No	`configs/inference.yaml`	Config file (`inference_1.3b.yaml` for 1.3B)
`--da3_model_path`	No	`./checkpoints/DA3`	DA3 depth model path
`--florence_model_path`	No	`./checkpoints/Florence-2-large`	Florence-2 model path
`--step1_gpus`	No	`0`	GPU ID(s) for Step 1 (comma-separated for parallel)
`--step2_gpus`	No	`0`	GPU ID(s) for Step 2 (comma-separated for parallel)
`--step3_gpus`	No	`0`	GPU ID(s) for Step 3
`--step3_nproc`	No	`1`	Number of GPUs for Step 3
`--output_folder`	No	`./output/<name>/<traj>`	Custom output directory
`--master_port`	No	`29513`	Master port for torchrun (Step 3)
`--skip_step1`	No	false	Skip caption generation
`--skip_step2`	No	false	Skip depth estimation
`--skip_step3`	No	false	Skip v2v inference
`--relative_to_source`	No	false	Compose trajectory poses relative to initial view
`--rotation_only`	No	false	Only apply rotation from trajectory, ignore translation (tripod pan/tilt)
`--render_backend`	No	`warper`	Rendering backend: `warper` uses fast depth-image warping and skips point-cloud saving; `ply` uses the original point-cloud renderer
`--disable_adaptive_frame`	No	false	Disable adaptive frame expansion/subsampling (use original frame count as-is)
`--freeze_repeat`	No	`0`	Repeat a specific frame N extra times to create a time-freeze (pause) effect
`--freeze_frame`	No	middle frame	Frame index to freeze; defaults to the middle frame if not specified
`--use_tae`	No	false	Use Tiny Auto Encoder (TAE) instead of WanVAE
`--tae_checkpoint_path`	No	`./checkpoints/taehv/taew2_1.pth`	Path to TAE checkpoint file (required when --use_tae is set)
`--compile_dit`	No	false	Apply torch.compile to the DiT model

Skip Already-Completed Steps

If Step 1 or Step 2 outputs already exist, you can skip them:

bash run_test_pipeline.sh \
  --input_dir ./my_videos \
  --traj_txt_path ./traj/x_y_circle_cycle.txt \
  --skip_step1 --skip_step2

Generate Temporal Control Videos

bash run_test_pipeline.sh \
  --input_dir ./test/example \
  --traj_txt_path ./traj/x_y_circle_cycle.txt \
  --freeze_repeat 150 \
  --output_folder ./output/example_freeze_repeat_150 \
  --disable_adaptive_frame

You can control the time-stop behavior using two specific parameters: use --freeze_frame to choose which frame to freeze (default middle frame), and --freeze_repeat to determine the duration (number of frames) of the pause.

Autonomous Driving Applications

bash run_test_pipeline.sh \
  --input_dir ./test/example3 \
  --traj_txt_path ./traj/x_y_circle_cycle.txt \
  --relative_to_source \
  --rotation_only \
  --disable_adaptive_frame

Speed Up

bash run_test_pipeline.sh \
  --input_dir ./test/example \
  --traj_txt_path ./traj/x_y_circle_cycle.txt \
  --use_tae \
  --disable_adaptive_frame

You can switch from VAE to TAE to accelerate the process. Furthermore, you can use --compile_dit to further boost the speed, reaching 24 fps on an H-series NVIDIA GPU (1.3B). However, please note that this operation requires a relatively long warm-up time when triggered for the first time. It is suitable for scenarios where you need to deploy as a service and pursue extreme speed.

License

This project is licensed under the Apache-2.0 License. Note that this license only applies to code in our library, the dependencies and submodules of which (Depth-Anything-3, Florence-2, TAEHV) are separate and individually licensed.

Citation

If you use InSpatio-World in your research, please use the following BibTeX entry.

@misc{inspatio-world,
    title={INSPATIO-WORLD: A Real-Time 4D World Simulator via Spatiotemporal Autoregressive Modeling},
    author={InSpatio Team},
    journal={arXiv preprint arXiv: 2604.07209},
    year={2026}
}

Acknowledgement

InSpatio-World utilizes a backbone based on Wan2.1, with its training code referencing Self-Forcing. Additionally, the TAE component for inference speed-up is built upon TAEV. We sincerely thank the Self-Forcing, Wan and TAEV team for their foundational work and open-source contribution. We also gratefully acknowledge Depth-Anything-3, Florence-2 and ReCamMaster for their excellent work that inspired and supported this project.