In-Video Instructions:Visual Signals as Generative Control

November 25, 2025 · View on GitHub

In-Video Instructions:
Visual Signals as Generative Control

Gongfan Fang Xinyin Ma Xinchao Wang

National University of Singapore

Preprint 2025

“Visuals as Precise Control”
The textual prompt is fixed as "Follow the instructions step by step," while the model synthesizes content purely from the embedded visual signals within the input frames.

📖 Abstract

Large-scale video generative models have recently demonstrated strong visual capabilities. In this work, we investigate whether such capabilities can be harnessed for controllable image-to-video generation by interpreting visual signals embedded within the frames as instructions, a paradigm we term In-Video Instruction.

In contrast to prompt-based control, which provides textual descriptions that are inherently global and coarse, In-Video Instruction encodes user guidance directly into the visual domain through elements such as overlaid text, arrows, or trajectories. This enables explicit, spatial-aware, and unambiguous correspondences between visual subjects and their intended actions by assigning distinct instructions to different objects.

Figure 1. Overview of the proposed In-Video Instruction framework.

🎥 Generative Results

Example 01	Example 02
Example 03	Example 04
Example 05	Example 06
Example 07	Example 08
Example 09	Example 10 (Kling 2.5)
Example 11 (Kling 2.5)	Example 12 (Kling 2.5)

📝 BibTeX

@article{fang2025invideo,
  title={In-Video Instructions: Visual Signals as Generative Control},
  author={Fang, Gongfan and Ma, Xinyin and Wang, Xinchao},
  journal={arXiv preprint arXiv:2511.19401},
  year={2025}
}