In-Video Instructions:Visual Signals as Generative Control

November 25, 2025 · View on GitHub

In-Video Instructions:
Visual Signals as Generative Control

Gongfan Fang    Xinyin Ma    Xinchao Wang

National University of Singapore

Preprint 2025

arXiv Project Page


Teaser Video

“Visuals as Precise Control”
The textual prompt is fixed as "Follow the instructions step by step," while the model synthesizes content purely from the embedded visual signals within the input frames.


📖 Abstract

Large-scale video generative models have recently demonstrated strong visual capabilities. In this work, we investigate whether such capabilities can be harnessed for controllable image-to-video generation by interpreting visual signals embedded within the frames as instructions, a paradigm we term In-Video Instruction.

In contrast to prompt-based control, which provides textual descriptions that are inherently global and coarse, In-Video Instruction encodes user guidance directly into the visual domain through elements such as overlaid text, arrows, or trajectories. This enables explicit, spatial-aware, and unambiguous correspondences between visual subjects and their intended actions by assigning distinct instructions to different objects.

Method Framework

Figure 1. Overview of the proposed In-Video Instruction framework.


🎥 Generative Results


Example 01

Example 02

Example 03

Example 04

Example 05

Example 06

Example 07

Example 08

Example 09

Example 10 (Kling 2.5)

Example 11 (Kling 2.5)

Example 12 (Kling 2.5)

📝 BibTeX

@article{fang2025invideo,
  title={In-Video Instructions: Visual Signals as Generative Control},
  author={Fang, Gongfan and Ma, Xinyin and Wang, Xinchao},
  journal={arXiv preprint arXiv:2511.19401},
  year={2025}
}