README.md
February 28, 2026 · View on GitHub
Video-as-Answer: Predict and Generate Next Video Event with Joint-GRPO
Junhao Cheng1†,
Liang Hou2,
Xin Tao2,
Jing Liao1
1City University of Hong Kong 2Kling Team, Kuaishou Technology
† This work was conducted during the author's internship at Kling Team, Kuaishou Technology
🔎 Introduction
We pioneer Video-Next-Event Prediction (VNEP), extending text-based next-event prediction to dynamic video responses. This shift from telling to showing enables more intuitive and customized answers for procedural learning and creative exploration.
To tackle VNEP, we propose VANS, a model that aligns a Vision-Language Model (VLM) with a Video Diffusion Model (VDM) through our Joint-GRPO post-training approach. Our method bridges the semantic-to-visual gap of VLM and VDM, enabling high-quality video event prediction and generation.
🏗️ Method
VANS Architecture: Dual-path processing with VLM for reasoning and VDM for generation |
Joint-GRPO: Two-stage co-steering optimization |
Key Components
VANS Architecture: Processes input videos and questions through dual pathways:
- VLM Path: Performs instruction-grounded reasoning to generate textual captions
- VDM Path: Synthesizes videos conditioned on semantic captions and visual context
Joint-GRPO: Our two-stage reinforcement learning approach:
- Stage 1: Visualization-friendly VLM tuning - optimizes captions for visual plausibility
- Stage 2: Context-faithful VDM adaptation - ensures semantic alignment and visual coherence
🚩 Plan
- Release training codes
- Release VANS-Data-100K construction pipelines
- Release VANS model checkpoints
- Release inference codes
- Release paper
🎬 Results
🍳 Procedural Teaching
| Case | Input Video | Question | VANS Output |
|---|---|---|---|
| 1 | ![]() | "Show me the next step for baked chicken Parmesan." | ![]() |
| 2 | ![]() | "Hi, I want to make slime. What should I do next?" | ![]() |
| 3 | ![]() | "Hey AI assistant, I'm making a paper windmill and just uploaded a video. What should I do next?" | ![]() |
🔮 Multi-Future Prediction
Same input video, different questions lead to diverse future predictions:
| Input Video | ||
|
||
![]() "What if she gets burned in her daily life?" |
![]() "What if she gets burned in an exaggerated movie?" |
![]() "What if she eats something spicy in an exaggerated movie?" |
| Input Video | ||
|
||
![]() "Show her reaction if she sees her grandson." |
![]() "Show her reaction if she sees her husband." |
![]() "Show her reaction if she sees the personification of death." |
🚀 Quick Start
🎯 Environment Setup
To set up the environment for inference, you can run the following command:
git clone https://github.com/KlingTeam/VANS.git
cd VANS
conda create -n VANS python==3.12 -y
conda activate VANS
pip install requirements.txt
cd vans/models_mllm/qwen-vl-utils
pip install -e .[decord]
cd ...
🌎 Download Models
To get started, download the VANS base models:
- Qwen2.5-VL-3B - The Vision-Language Model
- Wan2.1-T2V-1.3B - The Video Diffusion Model
Then download the complete VANS model:
VANS Model Download
🧸 Demo
To run local gradio demo:
python app.py
📜 Citation
If you find our work helpful, please consider giving a star 🌟 and citation 📝
@article{cheng2025video,
title={Video-as-Answer: Predict and Generate Next Video Event with Joint-GRPO},
author={Cheng, Junhao and Hou, Liang and Tao, Xin and Liao, Jing},
journal={arXiv preprint arXiv:2511.16669},
year={2025}
}











