🎬 Fostering Video Reasoning via Next-Event Prediction

August 9, 2025 · View on GitHub

🎬 Fostering Video Reasoning via Next-Event Prediction

🚀 Toward Video Reasoning via Future Prediction 🌟

Welcome to the official repository for Fostering Video Reasoning via Next-Event Prediction! 🚀
Read our paper on arXiv: 📖 2505.22457
Browse the dataset on Hugging Face: 📂 V1-33K

Video Reasoning via Future Prediction

To advance multimodal LLMs' reasoning ability, we introduce a future prediction task and its corresponding dataset. Predicting upcoming events from historical video data presents significant challenges for current multimodal LLMs. Our task pushes these models to infer future events based on the first part of a video, with the second part serving as open-ended ground truth (Self-Supervised Learning).

🤔 Why isn’t factual answering ideal for video reasoning?
Research indicates that reasoning models like DeepSeek R1 often “over-think”, which can lead to hallucinations. When applied to video data, similar pitfalls emerge if the model is restricted to answering straightforward factual questions. For instance, querying “Where is the cat in the video?” might prompt an overly extended reasoning process, inadvertently increasing the risk of hallucinated outputs.

💡 Why is future prediction a compelling case for video reasoning?
Much like Doctor Strange’s foresight in Avengers 3: Infinity War (2018), predicting the future demands reasoning over multiple potential outcomes. This challenge is analogous to techniques such as Monte Carlo tree search (MCTS), which systematically explores a wide array of possible scenarios. The inherent complexity of future prediction makes it a powerful task for evaluating and enhancing video reasoning capabilities.

📽️ Video Future Prediction: A Self-Supervised Task for Multimodal Reasoning
This task is inherently Self-Supervised Learning (SSL). It leverages the inherent causal logic present in video data. By dividing videos into sequential segments, we create implicit labels that embody the natural flow of cause and effect—allowing models to learn from the logical progression of events without the need for manual annotations.

Much like Image Contrastive Learning, which uses inherent data structures to construct labels and guide what a model should capture, Video Future Prediction is grounded in the philosophy that real-world events unfold through a chain of cause and effect. It drives the model to focus on the temporal and causal dimensions that underpin real-world scenarios, enhancing multimodal reasoning capabilities. By integrating visual cues, the model develops a holistic reasoning ability to more accurately predict and interpret the progression of complex events.

Moreover, like other self-supervised learning tasks and unsupervised learning, the data construction is relatively cheap, making it a scalable solution for enhancing multimodal reasoning capabilities.

📦 Features

🔍 Next-Event Prediction for video reasoning
🎓 Demo scripts for instruction tuning & reinforcement learning
🛠️ Easy use with LLaMA-Factory on GitHub & EasyR1

🐍 Setup

1. Create a Conda environment

conda create -n video_llm python=3.10 -y
conda activate video_llm

2. Download the V1-33K dataset

python v1_data_download.py

You should now see a folder named V1-33K/ containing:

first_part_video/

video_dataset/

🔧 LLaMA-Factory Integration

Clone the repo

git clone --depth 1 https://github.com/hiyouga/LLaMA-Factory.git
cd LLaMA-Factory

Install dependencies

pip install -e ".[torch,metrics]" --no-build-isolation

🗄️ Preparing Next-Event Prediction Data

# From the project root
python video_data_generation.py

The generated data will be placed in ./LLaMA-Factory/data/

Move necessary files

mv dataset_info.json LLaMA-Factory/data/
mv qwen2_5vl_7B_full_sft_5K.yaml LLaMA-Factory/examples/train_full/

🚀 Demo Training

Instruction Tuning
```
bash video_instruction_tuning_demo.sh
```

🤖 Reinforcement Learning with GRPO

Install RL Env
```
cd EasyR1
pip install -e .
```
Run the GRPO training demo
```
bash video_GRPO_training_demo.sh
```

🔥 Evaluation

We run all our evaluations based on the lmms-eval. Besides those benchmarks that have been implemented in lmms-eval, we also incorporate evaluations of our FutureBench as well as SeedBench-R1 into it. To start,

Install lmms-eval

# eval with lmms-eval
cd third_party/lmms-eval
pip install -e .

Preparing Dataset

You should also find the futurebench.json under the same folder named V1-33K/.
```
# make dataset from futurebench.json 
python gen_dataset.py
```
Run the inference

Before running the following eval script, check the dataset_path and cache_dir in third_party/lmms-eval/lmms_eval/tasks/futurebench/futurebench.yaml are correct.
```
bash third_party/lmms-eval/examples/eval_futurebench.sh
```
To run evaluations on other benchamarks, see more settings in third_party/lmms-eval/examples/.

📚 Citation

If you find this repository useful, please cite our paper:

@misc{wang2025fosteringvideoreasoningnextevent,
      title={Fostering Video Reasoning via Next-Event Prediction}, 
      author={Haonan Wang and Hongfu Liu and Xiangyan Liu and Chao Du and Kenji Kawaguchi and Ye Wang and Tianyu Pang},
      year={2025},
      eprint={2505.22457},
      archivePrefix={arXiv},
      primaryClass={cs.CV},
      url={https://arxiv.org/abs/2505.22457}, 
}

😊 Happy exploring & feel free to open an issue or pull request! 🎉