π¦ Video-KTR: Key-Token Reinforcement for Video Reasoning
January 28, 2026 Β· View on GitHub
Video-KTR is a reinforcement learning framework designed for complex video reasoning.
It identifies and amplifies critical visual--temporal tokens via selective gradient reinforcement, significantly improving video reasoning performance.
[π Paper] Β [π€ Video-KTR-7B-model]
[2026.01.26] πOur work is accepted by ICLR 2026.
π Highlights
- π State-of-the-art performance on multiple video reasoning benchmarks (Video-Holmes, VideoMMMU, MMVU, VideoMME)
- π― Key Token Reinforcement (KTR): amplifies signal on high-entropy / visual-aware / temporal-aware tokens
- π Better temporal and causal reasoning demonstrated by detailed case studies
Video-KTR improves video reasoning by identifying truly critical reasoning tokens. We use counterfactual probing (masking images or shuffling frames) to find
- π visual-aware tokens
- β° temporal-aware tokens
- and apply entropy filtering to select uncertain but informative tokens.
During training, non-critical tokens are masked, and gradients are reinforced only on key tokens, enabling more stable and accurate temporalβcausal reasoning.
π Main Results
- On Video-Holmes, Video-KTR reaches 42.7, nearly matching closed-source models such as GPT-4o (42.0) and Gemini-2.5-Pro (45.0) π―.
- At the 7B scale, it also substantially outperforms all existing open-source baselines, highlighting the effectiveness of our key-token reinforcement approach.
βοΈ Data Ablation on Different
Results from applying multiple post-training methods on the same dataset show that our approach consistently delivers superior performance.
π Analyze on Token Selecting
- We decompose the gradients of the final layer and show that the tokens we mask out contribute low-magnitude, highly scattered gradients, indicating weak and noisy supervision.
- The reduced training loss variance further confirms that our method leads to more stable and efficient optimization.
Qualitatively, our word-cloud and POS analyses further confirm that masked tokens are largely function words while the selected tokens are informative.
π¬ Case Studies on Event Causality Reasoning & Temporal Ordering
π§ Installation
# build environment
git clone https://github.com/ziyue1999/Video-KTR.git
cd Video-KTR
conda create -n video-r1 python=3.11
conda activate video-r1
bash setup.sh
# download training dataset
git lfs install
git clone https://huggingface.co/datasets/Video-R1/Video-R1-data
Please put the downloaded dataset to src/r1-v/Video-R1-data/. The Video-R1-260k.json file is for RL training.
Then, unzip the data
python ./src/unzip.py
Qwen2.5-VL has been frequently updated in the Transformers library, which may cause version-related bugs or inconsistencies. Our code is compatible with the following version, please download at here
Then install our provided version of transformers
unzip transformers-main.zip
cd ./transformers-main
pip install .
π Training
Our training pipeline builds upon Video-R1 and Qwen2.5-VL-SFT.
The script for GRPO training is as follows
cd src/r1-v
bash ../scripts/run_grpo_video_ktr.sh
π Inference & Evaluation
Run evaluation on Video-Holmes / VideoMMMU / MMVU:
bash ./src/eval_bench.sh
For infernce on a single example, you may use:
python ./src/inference_example.py
π Citation
If you find our work helpful for your research, please consider citing our work.
@misc{wang2026videoktrreinforcingvideoreasoning,
title={Video-KTR: Reinforcing Video Reasoning via Key Token Attribution},
author={Ziyue Wang and Sheng Jin and Zhongrong Zuo and Jiawei Wu and Han Qiu and Qi She and Hao Zhang and Xudong Jiang},
year={2026},
eprint={2601.19686},
archivePrefix={arXiv},
primaryClass={cs.CV},
url={https://arxiv.org/abs/2601.19686},
}