From Prompt to Progression: Taming Video Diffusion Models for Seamless Attribute Transition (ICCV2025)
September 25, 2025 ยท View on GitHub
Abstract: Existing models often struggle with complex temporal changes, particularly when generating videos with gradual attribute transitions. The most common prompt interpolation approach for motion transitions often fails to handle gradual attribute transitions, where inconsistencies tend to become more pronounced. In this work, we propose a simple yet effective method to extend existing models for smooth and consistent attribute transitions, through introducing frame-wise guidance during the denoising process. Our approach constructs a data-specific transitional direction for each noisy latent, guiding the gradual shift from initial to final attributes frame by frame while preserving the motion dynamics of the video. Moreover, we present the Controlled-Attribute-Transition Benchmark (CAT-Bench), which integrates both attribute and motion dynamics, to comprehensively evaluate the performance of different models. We further propose two metrics to assess the accuracy and smoothness of attribute transitions. Experimental results demonstrate that our approach performs favorably against existing baselines, achieving visual fidelity, maintaining alignment with text prompts, and delivering seamless attribute transitions.
๐ฏ Highlights
Example of Video Generation with Attribute Transitions Using the Same Base Model. The base model generates static appearances throughout the video. Prompt interpolation leads to inconsistencies, such as abrupt changes in the buildings, while our method ensures smoother and more consistent attribute transitions.
Key Contributions:
- ๐ Prompt2Progress: A novel method for text-to-video generation that improves temporal consistency for attribute transition without further training
- ๐ CATbench: A comprehensive benchmark for evaluating temporal consistency in video generation with attribute transition on two novel metrics: Wholistic Transition Score and Frame-wise Transition Score
- ๐ง Extensive Evaluation: Systematic comparison using diverse prompts across multiple attributes on video diffusion models.
๐ Quick Start
Installation
# Clone the repository
git clone https://github.com/lynn-ling-lo/Prompt2Progression.git
cd Prompt2Progression
# Create conda environment
conda create -n p2p python=3.8.5
conda activate p2p
# Install dependencies
pip install -r requirements.txt
Quick Demo
Prompt2Progress Text-to-Video Generation with Attribute Transition
cd prompt2progress
# Download pretrained VideoCrafter2 weights
# Put the model.ckpt in checkpoints/base_512_v2/model.ckpt
wget https://huggingface.co/VideoCrafter/VideoCrafter2/resolve/main/model.ckpt -P checkpoints/base_512_v2/
# Or manually download and place model.ckpt in checkpoints/base_512_v2/
# Run text-to-video generation
sh scripts/run_text2video_mp.sh
CATbench Evaluation
cd CATbench
# Evaluate temporal consistency using our metrics
python evaluate.py --videos_path /path/to/videos \
--prompt_file /path/to/prompt/file \
๐ Results
Quantitative Results
| Method | Wholistic Transition Score โ | Frame-wise Transition Score โ |
|---|---|---|
| AnimateDiff | 0.0082 | 0.0004 |
| Modelscope | 0.0042 | 0.0001 |
| Latte | 0.0019 | -0.0002 |
| VideoCrafter2 | 0.0022 | 0.0003 |
| Free-Bloom | 0.1077 | -0.0020 |
| VideoTetris | 0.0134 | 0.0012 |
| Gen-L | 0.1166 | 0.0135 |
| FreeNoise | 0.0578 | 0.0066 |
| Ours | 0.1486 | 0.0201 |
Qualitative Results
A bustling cityscape at midday, bathing in bright sunlight -> A nighttime cityscape under heavy rain
A ship sailing on the ocean in the dark light -> A ship sailing on the ocean in the bright light
๐ ๏ธ Usage
Prompt File Format
Both Prompt2Progression inference and CATbench evaluation use the same prompt format:
[initial state prompt];[final state prompt];[neutral prompt]
Example (age.txt):
a young girl is rowing a boat; an old girl is rowing a boat; a girl is rowing a boat
a young man walking in the park; an old man walking in the park; a man walking in the park
a young woman reading a book; an old woman reading a book; a woman reading a book
Example (weather.txt):
a house in sunny weather; a house in rainy weather; a house
a car driving in clear sky; a car driving in stormy weather; a car driving
a garden on a bright day; a garden on a cloudy day ;a garden
Prompt Guidelines:
- Use semicolons (
;) to separate the three components - Initial state: Starting condition/attribute
- Final state: Target condition/attribute to transition to
- Neutral prompt: Base prompt without specific attributes
- Keep consistent actions/objects across all three prompts
- One prompt triplet per line
Prompt2Progress Inference
Generate videos with temporal consistency using your prompts:
cd prompt2progress
# 1. Configure hyperparameters in the script
# Edit scripts/run_text2video_mp.sh to set:
# - Input prompt file path
# - Output directory
# 2. Run inference
sh scripts/run_text2video_mp.sh
CATbench Evaluation
Evaluate temporal consistency of generated videos:
cd CATbench
# Evaluate with custom prompts
python evaluate.py --prompt_file custom_prompts.txt \
--videos_path /path/to/your/videos \
๐ Citation
If you find our work useful for your research, please consider citing:
@inproceedings{lo2025p2p,
title={From Prompt to Progression: Taming Video Diffusion Models for Seamless Attribute Transition},
author={Lo, Ling and Chan, Kelvin CK and Cheng, Wen-Huang and Yang Ming-Hsuan},
booktitle={Proceedings of the IEEE/CVF International Conference on Computer Vision (ICCV)},
year={2025}
}
๐ Acknowledgments
We extend our heartfelt gratitude to the open-source community whose foundational work enabled this research. We particularly thank:
โญ If this work is helpful for your research, please consider giving us a star! โญ