π¬ Fostering Video Reasoning via Next-Event Prediction
August 9, 2025 Β· View on GitHub
π¬ Fostering Video Reasoning via Next-Event Prediction
Welcome to the official repository for Fostering Video Reasoning via Next-Event Prediction! π
Read our paper on arXiv: π 2505.22457
Browse the dataset on Hugging Face: π V1-33K
Video Reasoning via Future Prediction
To advance multimodal LLMs' reasoning ability, we introduce a future prediction task and its corresponding dataset. Predicting upcoming events from historical video data presents significant challenges for current multimodal LLMs. Our task pushes these models to infer future events based on the first part of a video, with the second part serving as open-ended ground truth (Self-Supervised Learning).
π€ Why isnβt factual answering ideal for video reasoning?
Research indicates that reasoning models like DeepSeek R1 often βover-thinkβ, which can lead to hallucinations. When applied to video data, similar pitfalls emerge if the model is restricted to answering straightforward factual questions. For instance, querying βWhere is the cat in the video?β might prompt an overly extended reasoning process, inadvertently increasing the risk of hallucinated outputs.
π‘ Why is future prediction a compelling case for video reasoning?
Much like Doctor Strangeβs foresight inAvengers 3: Infinity War (2018), predicting the future demands reasoning over multiple potential outcomes. This challenge is analogous to techniques such as Monte Carlo tree search (MCTS), which systematically explores a wide array of possible scenarios. The inherent complexity of future prediction makes it a powerful task for evaluating and enhancing video reasoning capabilities.
π½οΈ Video Future Prediction: A Self-Supervised Task for Multimodal Reasoning
This task is inherently Self-Supervised Learning (SSL). It leverages the inherent causal logic present in video data. By dividing videos into sequential segments, we create implicit labels that embody the natural flow of cause and effectβallowing models to learn from the logical progression of events without the need for manual annotations.Much like
Image Contrastive Learning, which uses inherent data structures to construct labels and guide what a model should capture,Video Future Predictionis grounded in the philosophy that real-world events unfold through a chain of cause and effect. It drives the model to focus on the temporal and causal dimensions that underpin real-world scenarios, enhancing multimodal reasoning capabilities. By integrating visual cues, the model develops a holistic reasoning ability to more accurately predict and interpret the progression of complex events.Moreover, like other self-supervised learning tasks and unsupervised learning, the data construction is relatively cheap, making it a scalable solution for enhancing multimodal reasoning capabilities.
π¦ Features
- π Next-Event Prediction for video reasoning
- π Demo scripts for instruction tuning & reinforcement learning
- π οΈ Easy use with LLaMA-Factory on GitHub & EasyR1
π Setup
1. Create a Conda environment
conda create -n video_llm python=3.10 -y
conda activate video_llm
2. Download the V1-33K dataset
python v1_data_download.py
You should now see a folder named
V1-33K/containing:
first_part_video/video_dataset/
π§ LLaMA-Factory Integration
-
Clone the repo
git clone --depth 1 https://github.com/hiyouga/LLaMA-Factory.git cd LLaMA-Factory -
Install dependencies
pip install -e ".[torch,metrics]" --no-build-isolation
ποΈ Preparing Next-Event Prediction Data
# From the project root
python video_data_generation.py
The generated data will be placed in
./LLaMA-Factory/data/
Move necessary files
mv dataset_info.json LLaMA-Factory/data/
mv qwen2_5vl_7B_full_sft_5K.yaml LLaMA-Factory/examples/train_full/
π Demo Training
-
Instruction Tuning
bash video_instruction_tuning_demo.sh
π€ Reinforcement Learning with GRPO
-
Install RL Env
cd EasyR1 pip install -e . -
Run the GRPO training demo
bash video_GRPO_training_demo.sh
π₯ Evaluation
We run all our evaluations based on the lmms-eval. Besides those benchmarks that have been implemented in lmms-eval, we also incorporate evaluations of our FutureBench as well as SeedBench-R1 into it. To start,
-
Install lmms-eval
# eval with lmms-eval cd third_party/lmms-eval pip install -e . -
Preparing Dataset
You should also find the
futurebench.jsonunder the same folder namedV1-33K/.# make dataset from futurebench.json python gen_dataset.py -
Run the inference
Before running the following eval script, check the
dataset_pathandcache_dirinthird_party/lmms-eval/lmms_eval/tasks/futurebench/futurebench.yamlare correct.bash third_party/lmms-eval/examples/eval_futurebench.shTo run evaluations on other benchamarks, see more settings in
third_party/lmms-eval/examples/.
π Citation
If you find this repository useful, please cite our paper:
@misc{wang2025fosteringvideoreasoningnextevent,
title={Fostering Video Reasoning via Next-Event Prediction},
author={Haonan Wang and Hongfu Liu and Xiangyan Liu and Chao Du and Kenji Kawaguchi and Ye Wang and Tianyu Pang},
year={2025},
eprint={2505.22457},
archivePrefix={arXiv},
primaryClass={cs.CV},
url={https://arxiv.org/abs/2505.22457},
}
π Happy exploring & feel free to open an issue or pull request! π
