README.md

February 18, 2026 · View on GitHub

Thinking with Camera: A Unified Multimodal Model for Camera-Centric Understanding and Generation

Thinking with Camera

Thinking with Camera: A Unified Multimodal Model for Camera-Centric Understanding and Generation

Kang Liao, Size Wu, Zhonghua Wu, Linyi Jin, Chao Wang, Yikai Wang, Fei Wang, Wei Li, Chen Change Loy

Introduction

We introduce Puffin, a camera-centric unified multimodal model designed to advance spatial intelligence, which enables the generation and understanding of the world from arbitrary viewpoints and orientations simultaneously.

📝 Changelog & News

2026.01.26: Puffin has been accepted at ICLR 2026.
2026.01.15: Puffin-4M dataset reached 20,000 downloads on Hugging Face within three months of release.
2026.01.10: The scripts of the camera-centric evaluation has been released.
2025.10.10: The paper, project page, code, model, dataset, and demo of Puffin are online.
Release the scripts of the dataset construction pipeline.
Release the camera caption (by our method) of the commonly used large-scale text-to-image datasets, such as megalith-10m.

🖥️ Requirements and Installation

The code has been implemented with PyTorch 2.7.0 and CUDA 12.6.

An example of installation commands is provided as follows:

# git clone this repository
git clone https://github.com/KangLiao929/Puffin
cd Puffin

# create new anaconda env
conda create -n Puffin python=3.10
conda activate Puffin

# install python dependencies
pip install torch==2.7.0 torchvision==0.22.0 --index-url https://download.pytorch.org/whl/cu124
pip install -r requirements.txt

🏂 Demo & Quick Inference

We release three model variants: Puffin-Base, Puffin-Thinking, and Puffin-Instruct, to accommodate different application needs. Puffin-Base provides a foundation model for unified camera-centric understanding and generation; Puffin-Thinking enhances spatial reasoning and generation by thinking with camera; and Puffin-Instruct is optimized by instruction tuning, supporting cross-view tasks and complex multimodal interactions.

Download the model checkpoints from 🤗 KangLiao/Puffin and organize them as follows:

Puffin/
├── checkpoints
    ├── Puffin-Align.pth # provided for customized SFT
    ├── Puffin-Base.pth
    ├── Puffin-Thinking.pth
    ├── Puffin-Instruct.pth

It is recommended to use the following command to download the checkpoints

# pip install -U "huggingface_hub[cli]"
huggingface-cli download KangLiao/Puffin  --local-dir checkpoints --repo-type model

Camera-controllable Image Generation

The generated images can be obtained by text prompts and camera prompts (roll: -r, pitch: -p, vertical field-of-view: -f, all in radius) using the following command:

export PYTHONPATH=./:$PYTHONPATH
python scripts/demo/generation.py configs/pipelines/stage_2_base.py \
          --checkpoint checkpoints/Puffin-Base.pth --output generation_result.jpg \
          --prompt "A streetlamp casts light on an outdoor mural with intricate floral designs and text, set against a building wall." \
          -r -0.3939 -p 0.0277 -f 0.7595

To enable the thinking mode of image generation, please simply change the settings and append --thinking flag:

python scripts/demo/generation.py configs/pipelines/stage_3_thinking.py \
          --checkpoint checkpoints/Puffin-Thinking.pth --output generation_result_thinking.jpg \
          --prompt "A streetlamp casts light on an outdoor mural with intricate floral designs and text, set against a building wall." \
          -r -0.3939 -p 0.0277 -f 0.7595 \
          --thinking

Camera Understanding

The camera understanding results (scene descriptions and camera parameters) can be obtained using the following command:

python scripts/demo/understanding.py configs/pipelines/stage_2_base.py \
          --checkpoint checkpoints/Puffin-Base.pth --image_path assets/test_img/test.jpg \
          --save_dir vis_results/

The visualization results (pixel-wise camera maps) can also be found at --save_dir.

Like the camera-controllable generation, the thinking mode can be enabled by changing the settings and append --thinking flag:

python scripts/demo/understanding.py configs/pipelines/stage_3_thinking.py \
          --checkpoint checkpoints/Puffin-Thinking.pth --image_path assets/test_img/test.jpg \
          --save_dir vis_results/ \
          --thinking

World Exploration

The generated target view can be obtained by an initial view and camera prompts (roll: -r, pitch: -p, yaw: -y, all in radius) using the following command:

python scripts/demo/world_exploration.py configs/pipelines/stage_4_instruction_tuning.py \
          --checkpoint checkpoints/Puffin-Instruct.pth --init_image assets/test_img/test_cross_view.jpg \
          --output world_exploration_result.jpg \
          -r 0.1 -p -0.1 -y 0.2

The above process can be applied to the 3D world generation (e.g., Figure A8 in the paper) like world models, the multi-view results are generated around an initial view:

python scripts/demo/world_exploration_3D.py configs/pipelines/stage_4_instruction_tuning.py \
          --checkpoint checkpoints/Puffin-Instruct.pth --init_view_path assets/test_img/ \
          --output world_exploration_3D/

Spatial Imagination

Given an initial view and the expected location (left, behind, and right), Puffin can imagine the scene description of the target view using the following command:

python scripts/demo/spatial_imagination.py configs/pipelines/stage_4_instruction_tuning.py \
          --checkpoint checkpoints/Puffin-Instruct.pth --image assets/test_img/test_cross_view.jpg \
          --location behind

Photographic Guidance

Puffin can suggest camera parameter adjustments from an initial view to achieve images with higher photographic aesthetics. The deviation (pitch and yaw) between the target image and initial image can be obtained using the following command:

python scripts/demo/photographic_guidance.py configs/pipelines/stage_4_instruction_tuning.py \
          --checkpoint checkpoints/Puffin-Instruct.pth --image assets/test_img/test_cross_view.jpg

Puffin-4M Dataset

Datasets and benchmarks that span vision, language, and camera modalities remain scarce in the domain of spatial multimodal intelligence. To address this gap, we introduce Puffin-4M, a large-scale, high-quality dataset comprising 4 million vision-language-camera triplets. We release the training data and evaluation benchmark in 🤗 KangLiao/Puffin-4M. The whole dataset is approximately 449GB in size. Note that we omit the camera maps from the uploaded training data due to their large total size (~3 MB each, amounting to ~11.4 TB in total). However, these maps can be easily generated from the captions using the following command:

python scripts/camera/cam_dataset.py \
          --input_root Puffin-4M/training_data/cap_folder \
          --output_root Puffin-4M/training_data/cam_folder

The scripts of the construction pipeline for our Puffin-4M will be updated in Dataset Pipeline soon.

✈️ Training

We conduct a multi-stage training strategy, where the vision encoder, LLM, and the diffusion model are aligned in the first stage. Then, in the SFT stage, the models are jointly optimized using both base and thinking datasets. Finally, an instruction-tuning stage is applied, involving various cross-view generation and understanding tasks. The implementation details are provided in Training.

🖼️ Evaluation

We evaluate our camera-centric generation and understanding performance on public datasets and our constructed benchmark (🤗 KangLiao/Puffin-4M/benchmark).

For camera understanding, we conduct evaluations on three common datasets, MegaDepth, TartanAir, and LaMAR. Notably, images from these datasets are primarily captured or simulated in well-structured environments. Moreover, the camera parameters in some datasets are limited in distribution. To complement these settings, we construct a more challenging dataset, Puffin-Und, designed for a comprehensive assessment of camera understanding. This dataset contains 1,000 images spanning diverse camera configurations and scenarios (🤗 KangLiao/Puffin-4M/benchmark/Puffin-Und). Additionally, since no benchmark dataset exists for text-to-image generation with precise camera parameters, we construct Puffin-Gen to fill this gap. The dataset consists of 650 caption–camera pairs spanning diverse scenarios and camera configurations (🤗 KangLiao/Puffin-4M/benchmark/Puffin-Gen). The evaluation details are provided in Evaluation.

📚 Citation

If you find Puffin useful for your research or applications, please cite our paper using the following BibTeX:

  @article{liao2025puffin,
    title={Thinking with Camera: A Unified Multimodal Model for Camera-Centric Understanding and Generation},
    author={Liao, Kang and Wu, Size and Wu, Zhonghua and Jin, Linyi and Wang, Chao and Wang, Yikai and Wang, Fei and Li, Wei and Loy, Chen Change},
    journal={arXiv preprint arXiv:2510.08673},
    year={2025}
  }

🗞️ License

This project is licensed under NTU S-Lab License 1.0.

🙏 Acknowledgement

The project builds upon OpenUni, MetaQuery, Qwen2.5, RADIOv3, SD3, and GeoCalib.