RealGeneral: Unifying Visual Generation via Temporal In-Context Learning with Video Models

July 16, 2025 · View on GitHub

https://github.com/user-attachments/assets/0c4448a4-93f3-4a63-acc7-488657439e37

RealGeneral: Unifying Visual Generation via Temporal In-Context Learning with Video Models

Yijing Lin, Mengqi Huang, Shuhan Zhuang, Zhendong Mao^†
University of Science and Technology of China
^†corresponding author

🎉🎉 Our paper, “RealGeneral: Unifying Visual Generation via Temporal In-Context Learning with Video Models” accepted by ICCV 2025! Our project page.

Requirements

The training is conducted on 2 A100 GPUs (80GB VRAM), the inference is tested on 1 A100 GPU.

Setup

git clone https://github.com/Lyne1/Realgeneral.git
cd RealGeneral

Environment

All the tests are conducted in Linux. We suggest running our code in Linux. To set up our environment in Linux, please run:

conda create -n RealGeneral python=3.10 -y
conda activate RealGeneral

bash env.sh

🔗 Checkpoints

CogVideoX-1.5 T2V Checkpoint Download the pre-trained CogVideoX-1.5 (Text-to-Video) checkpoint from this link, and place the entire folder under the pretrained_weights directory. The resulting structure should look like this:
```
./pretrained_weights/CogVideoX1.5-5B
```
RealGeneral Checkpoints Download the pre-trained RealGeneral checkpoints from this link. We provide two LoRA checkpoints for different tasks:
- Subject-driven generation
- Canny-to-image translation
Place them under pretrained_weights:
```
./pretrained_weights/IP-LoRA
./pretrained_weights/Canny2image-LoRA
```

🎨 Inference

cd inference
bash run_ip.sh          # For subject-driven image generation

# For other tasks:
# bash run_canny.sh

Note: For tasks other than subject-driven generation, make sure to append a task-specific description to the prompt automatically during inference. This is handled internally according to the task type. For example, a canny2image task will append “The image has the specific canny map” to your original prompt.

The supported task types and their corresponding additions are:

Task Type Appended Description
canny2image The image has the specific canny map
depth2image The image has the specific depth map
image2depth The image has the specific depth map
deblurring The image has a blur map
filling The image has the specific filling map
coloring The image has the specific grey map

Task Type	Appended Description
canny2image	The image has the specific canny map
depth2image	The image has the specific depth map
image2depth	The image has the specific depth map
deblurring	The image has a blur map
filling	The image has the specific filling map
coloring	The image has the specific grey map

🏋️ Train on Your Own Data

1. Dataset Preparation

Your dataset directory should be structured as follows:

.
├── videos/                 # Folder containing video files
├── videos.txt              # List of video file paths
├── prompts.txt             # Text prompts for each video
└── instance.txt            # (Optional) Subject words for subject-driven generation

2. Training

💡 Tip: You can adjust the number of GPUs used for training by modifying the num_processes value in finetune/accelerate_config_machine_single.yaml.

To start training:

cd finetune
bash finetune_ip.sh         # For subject-driven generation

# For other tasks:
# bash finetune_other_task.sh

Note: For tasks beyond subject-driven generation, you’ll need to modify the --purpose argument to specify the task type.

Citation:

Don't forget to cite this source if it proves useful in your research!

@misc{lin2025realgeneralunifyingvisualgeneration,
      title={RealGeneral: Unifying Visual Generation via Temporal In-Context Learning with Video Models}, 
      author={Yijing Lin and Mengqi Huang and Shuhan Zhuang and Zhendong Mao},
      year={2025},
      eprint={2503.10406},
      archivePrefix={arXiv},
      primaryClass={cs.CV},
      url={https://arxiv.org/abs/2503.10406}, 
}