README.md

February 28, 2026 · View on GitHub

Video-as-Answer: Predict and Generate Next Video Event with Joint-GRPO

Junhao Cheng^1†, Liang Hou², Xin Tao², Jing Liao¹
¹City University of Hong Kong ²Kling Team, Kuaishou Technology
^† This work was conducted during the author's internship at Kling Team, Kuaishou Technology

🔎 Introduction

We pioneer Video-Next-Event Prediction (VNEP), extending text-based next-event prediction to dynamic video responses. This shift from telling to showing enables more intuitive and customized answers for procedural learning and creative exploration.

To tackle VNEP, we propose VANS, a model that aligns a Vision-Language Model (VLM) with a Video Diffusion Model (VDM) through our Joint-GRPO post-training approach. Our method bridges the semantic-to-visual gap of VLM and VDM, enabling high-quality video event prediction and generation.

🏗️ Method

VANS Architecture: Dual-path processing with VLM for reasoning and VDM for generation

Joint-GRPO: Two-stage co-steering optimization

Key Components

VANS Architecture: Processes input videos and questions through dual pathways:

VLM Path: Performs instruction-grounded reasoning to generate textual captions
VDM Path: Synthesizes videos conditioned on semantic captions and visual context

Joint-GRPO: Our two-stage reinforcement learning approach:

Stage 1: Visualization-friendly VLM tuning - optimizes captions for visual plausibility
Stage 2: Context-faithful VDM adaptation - ensures semantic alignment and visual coherence

Case	Input Video	Question	VANS Output
1		"Show me the next step for baked chicken Parmesan."
2		"Hi, I want to make slime. What should I do next?"
3		"Hey AI assistant, I'm making a paper windmill and just uploaded a video. What should I do next?"

🔮 Multi-Future Prediction

Same input video, different questions lead to diverse future predictions:

Input Video

"What if she gets burned in her daily life?"	"What if she gets burned in an exaggerated movie?"	"What if she eats something spicy in an exaggerated movie?"

Input Video

"Show her reaction if she sees her grandson."	"Show her reaction if she sees her husband."	"Show her reaction if she sees the personification of death."

🚀 Quick Start

🎯 Environment Setup

To set up the environment for inference, you can run the following command:

git clone https://github.com/KlingTeam/VANS.git
cd VANS

conda create -n VANS python==3.12 -y
conda activate VANS

pip install requirements.txt
cd vans/models_mllm/qwen-vl-utils
pip install -e .[decord]
cd ...

🌎 Download Models

To get started, download the VANS base models:

Qwen2.5-VL-3B - The Vision-Language Model
Wan2.1-T2V-1.3B - The Video Diffusion Model

Then download the complete VANS model:
VANS Model Download

🧸 Demo

To run local gradio demo:

python app.py

📜 Citation

If you find our work helpful, please consider giving a star 🌟 and citation 📝

@article{cheng2025video,
  title={Video-as-Answer: Predict and Generate Next Video Event with Joint-GRPO},
  author={Cheng, Junhao and Hou, Liang and Tao, Xin and Liao, Jing},
  journal={arXiv preprint arXiv:2511.16669},
  year={2025}
}