RealGeneral: Unifying Visual Generation via Temporal In-Context Learning with Video Models

July 16, 2025 ยท View on GitHub

https://github.com/user-attachments/assets/0c4448a4-93f3-4a63-acc7-488657439e37

RealGeneral: Unifying Visual Generation via Temporal In-Context Learning with Video Models

Yijing Lin, Mengqi Huang, Shuhan Zhuang, Zhendong Maoโ€ 
University of Science and Technology of China
โ€ corresponding author

๐ŸŽ‰๐ŸŽ‰ Our paper, โ€œRealGeneral: Unifying Visual Generation via Temporal In-Context Learning with Video Modelsโ€ accepted by ICCV 2025! Our project page.

Requirements

The training is conducted on 2 A100 GPUs (80GB VRAM), the inference is tested on 1 A100 GPU.

Setup

git clone https://github.com/Lyne1/Realgeneral.git
cd RealGeneral

Environment

All the tests are conducted in Linux. We suggest running our code in Linux. To set up our environment in Linux, please run:

conda create -n RealGeneral python=3.10 -y
conda activate RealGeneral

bash env.sh

๐Ÿ”— Checkpoints

  1. CogVideoX-1.5 T2V Checkpoint Download the pre-trained CogVideoX-1.5 (Text-to-Video) checkpoint from this link, and place the entire folder under the pretrained_weights directory. The resulting structure should look like this:

    ./pretrained_weights/CogVideoX1.5-5B
    
  2. RealGeneral Checkpoints Download the pre-trained RealGeneral checkpoints from this link. We provide two LoRA checkpoints for different tasks:

    • Subject-driven generation
    • Canny-to-image translation

    Place them under pretrained_weights:

    ./pretrained_weights/IP-LoRA
    ./pretrained_weights/Canny2image-LoRA
    

๐ŸŽจ Inference

cd inference
bash run_ip.sh          # For subject-driven image generation

# For other tasks:
# bash run_canny.sh

Note: For tasks other than subject-driven generation, make sure to append a task-specific description to the prompt automatically during inference. This is handled internally according to the task type. For example, a canny2image task will append โ€œThe image has the specific canny mapโ€ to your original prompt.

The supported task types and their corresponding additions are:

Task TypeAppended Description
canny2imageThe image has the specific canny map
depth2imageThe image has the specific depth map
image2depthThe image has the specific depth map
deblurringThe image has a blur map
fillingThe image has the specific filling map
coloringThe image has the specific grey map

๐Ÿ‹๏ธ Train on Your Own Data

1. Dataset Preparation

Your dataset directory should be structured as follows:

.
โ”œโ”€โ”€ videos/                 # Folder containing video files
โ”œโ”€โ”€ videos.txt              # List of video file paths
โ”œโ”€โ”€ prompts.txt             # Text prompts for each video
โ””โ”€โ”€ instance.txt            # (Optional) Subject words for subject-driven generation

2. Training

๐Ÿ’ก Tip: You can adjust the number of GPUs used for training by modifying the num_processes value in finetune/accelerate_config_machine_single.yaml.

To start training:

cd finetune
bash finetune_ip.sh         # For subject-driven generation

# For other tasks:
# bash finetune_other_task.sh

Note: For tasks beyond subject-driven generation, youโ€™ll need to modify the --purpose argument to specify the task type.

Citation:

Don't forget to cite this source if it proves useful in your research!

@misc{lin2025realgeneralunifyingvisualgeneration,
      title={RealGeneral: Unifying Visual Generation via Temporal In-Context Learning with Video Models}, 
      author={Yijing Lin and Mengqi Huang and Shuhan Zhuang and Zhendong Mao},
      year={2025},
      eprint={2503.10406},
      archivePrefix={arXiv},
      primaryClass={cs.CV},
      url={https://arxiv.org/abs/2503.10406}, 
}