In-Video Instructions:Visual Signals as Generative Control
November 25, 2025 · View on GitHub
In-Video Instructions:
Visual Signals as Generative Control
Gongfan Fang Xinyin Ma Xinchao Wang
National University of Singapore
Preprint 2025
“Visuals as Precise Control”
The textual prompt is fixed as "Follow the instructions step by step," while the model synthesizes content purely from the embedded visual signals within the input frames.
📖 Abstract
Large-scale video generative models have recently demonstrated strong visual capabilities. In this work, we investigate whether such capabilities can be harnessed for controllable image-to-video generation by interpreting visual signals embedded within the frames as instructions, a paradigm we term In-Video Instruction.
In contrast to prompt-based control, which provides textual descriptions that are inherently global and coarse, In-Video Instruction encodes user guidance directly into the visual domain through elements such as overlaid text, arrows, or trajectories. This enables explicit, spatial-aware, and unambiguous correspondences between visual subjects and their intended actions by assigning distinct instructions to different objects.
Figure 1. Overview of the proposed In-Video Instruction framework.
🎥 Generative Results
Example 01 |
Example 02 |
Example 03 |
Example 04 |
Example 05 |
Example 06 |
Example 07 |
Example 08 |
Example 09 |
Example 10 (Kling 2.5) |
Example 11 (Kling 2.5) |
Example 12 (Kling 2.5) |
📝 BibTeX
@article{fang2025invideo,
title={In-Video Instructions: Visual Signals as Generative Control},
author={Fang, Gongfan and Ma, Xinyin and Wang, Xinchao},
journal={arXiv preprint arXiv:2511.19401},
year={2025}
}