๐Ÿงฌ Generative Spatial Transformer (GST)

August 9, 2025 ยท View on GitHub

๐Ÿงฌ Generative Spatial Transformer (GST)

Implementation of GST from Where Am I and What Will I See : An Auto-Regressive Model for Spatial Localization and View Prediction in Pytorch.

arXiv (coming soon)ย  project pageย  huggingface weightsย 

โœจ๏ธ News

  • 2025-2: Code is released.

๐Ÿ› ๏ธ Installation

  1. Environment setting
conda create -n gst python=3.8

pip install -r requirements.txt
  1. Model weight download

We provide Image tokenizer, Camera tokenizer, and Auto-regressive model in huggingface weightsย . Please download the following three ckpt and place them in the folder ./ckpts.

image-16.pt # Adopting from LlamaGen
camera-4.pt
gst.pt

๐Ÿš€ Inference

GST has constructed a joint distribution of images and corresponding perspectives. Use the following command to sample --num-sample perspectives and images under a given observation --image-path.

python run_sample_camera_image.py \
    --image-ckpt   /path/to/image-16.pt  \
    --gpt-ckpt     /path/to/gst.pt \
    --camera-ckpt  /path/to/camera-4.pt \
    --image-path assets/hydrant.jpg \
    --num-sample 16 

More optional parameters can be found in the script run_sample_camera_image.py. After sampling, the results will be saved in the folder sample. The folder structure is as follows:

sample
โ”œโ”€โ”€ camera.ply      # Saved the 3D position and orientation of the perspectives
โ”œโ”€โ”€ images.obj      # Saved the images corresponding to each perspective
โ”‚ย ย  
โ”œโ”€โ”€ material_0.png  # Texture
โ”œโ”€โ”€ material_1.png 
โ”œโ”€โ”€ ...
โ”œโ”€โ”€ material.mtl    # Texture mapping of 3D files
โ”‚ย ย  
โ”œโ”€โ”€ sample_0.png    # Sampled image
โ”œโ”€โ”€ sample_0.npy    # The camera matrix obtained by converting the sampled camera
โ”œโ”€โ”€ sample_1.png 
โ”œโ”€โ”€ sample_1.npy 
โ””โ”€โ”€ ...

The GST employs the RDF coordinate system, where the positive direction of the x-axis is oriented to the right (R), the positive direction of the y-axis is directed downward (D), and the positive direction of the z-axis is oriented forward (F). The sampled ply and obj files can be opened in meshlab or other three-dimensional software, as illustrated below:

๐Ÿ“ƒ License

The majority of this project is licensed under MIT License. Portions of the project are available under separate license of referred projects, detailed in corresponding files.

โœจ Citation

If our work assists your research, feel free to give us a star โญ or cite us using:

@article{chen2024and,
  title={Where Am I and What Will I See: An Auto-Regressive Model for Spatial Localization and View Prediction},
  author={Chen, Junyi and Huang, Di and Ye, Weicai and Ouyang, Wanli and He, Tong},
  journal={arXiv preprint arXiv:2410.18962},
  year={2024}
}

๐Ÿ’– Acknowledgement

We would like to express our gratitude to the contributors of the codebase provided by LlamaGen, which served as the foundation for our work. Special thanks are extended to the pioneering contributions of Zero123, ZeroNVS and RayDiffusion within the field, which have enriched our understanding and inspired our endeavors.