🦈 Video-KTR: Key-Token Reinforcement for Video Reasoning

January 28, 2026 Β· View on GitHub

Video-KTR is a reinforcement learning framework designed for complex video reasoning.
It identifies and amplifies critical visual--temporal tokens via selective gradient reinforcement, significantly improving video reasoning performance.

[πŸ“– Paper] Β  [πŸ€— Video-KTR-7B-model]

[2026.01.26] πŸŽ‰Our work is accepted by ICLR 2026.

🌟 Highlights

  • πŸš€ State-of-the-art performance on multiple video reasoning benchmarks (Video-Holmes, VideoMMMU, MMVU, VideoMME)
  • 🎯 Key Token Reinforcement (KTR): amplifies signal on high-entropy / visual-aware / temporal-aware tokens
  • πŸ” Better temporal and causal reasoning demonstrated by detailed case studies

Video-KTR improves video reasoning by identifying truly critical reasoning tokens. We use counterfactual probing (masking images or shuffling frames) to find

  • πŸ‘€ visual-aware tokens
  • ⏰ temporal-aware tokens
  • and apply entropy filtering to select uncertain but informative tokens.

During training, non-critical tokens are masked, and gradients are reinforced only on key tokens, enabling more stable and accurate temporal–causal reasoning.


πŸ“Š Main Results

  • On Video-Holmes, Video-KTR reaches 42.7, nearly matching closed-source models such as GPT-4o (42.0) and Gemini-2.5-Pro (45.0) 🎯.
  • At the 7B scale, it also substantially outperforms all existing open-source baselines, highlighting the effectiveness of our key-token reinforcement approach.

βš–οΈ Data Ablation on Different

Results from applying multiple post-training methods on the same dataset show that our approach consistently delivers superior performance.

πŸ” Analyze on Token Selecting

  • We decompose the gradients of the final layer and show that the tokens we mask out contribute low-magnitude, highly scattered gradients, indicating weak and noisy supervision.
  • The reduced training loss variance further confirms that our method leads to more stable and efficient optimization.

Qualitatively, our word-cloud and POS analyses further confirm that masked tokens are largely function words while the selected tokens are informative.

🎬 Case Studies on Event Causality Reasoning & Temporal Ordering


πŸ”§ Installation

# build environment
git clone https://github.com/ziyue1999/Video-KTR.git
cd Video-KTR

conda create -n video-r1 python=3.11 
conda activate video-r1
bash setup.sh

# download training dataset
git lfs install
git clone https://huggingface.co/datasets/Video-R1/Video-R1-data

Please put the downloaded dataset to src/r1-v/Video-R1-data/. The Video-R1-260k.json file is for RL training. Then, unzip the data

python ./src/unzip.py

Qwen2.5-VL has been frequently updated in the Transformers library, which may cause version-related bugs or inconsistencies. Our code is compatible with the following version, please download at here

Then install our provided version of transformers

unzip transformers-main.zip
cd ./transformers-main
pip install .

πŸš€ Training

Our training pipeline builds upon Video-R1 and Qwen2.5-VL-SFT.

The script for GRPO training is as follows

cd src/r1-v
bash ../scripts/run_grpo_video_ktr.sh

πŸ– Inference & Evaluation

Run evaluation on Video-Holmes / VideoMMMU / MMVU:

bash ./src/eval_bench.sh

For infernce on a single example, you may use:

python ./src/inference_example.py

πŸ“‘ Citation

If you find our work helpful for your research, please consider citing our work.

@misc{wang2026videoktrreinforcingvideoreasoning,
      title={Video-KTR: Reinforcing Video Reasoning via Key Token Attribution}, 
      author={Ziyue Wang and Sheng Jin and Zhongrong Zuo and Jiawei Wu and Han Qiu and Qi She and Hao Zhang and Xudong Jiang},
      year={2026},
      eprint={2601.19686},
      archivePrefix={arXiv},
      primaryClass={cs.CV},
      url={https://arxiv.org/abs/2601.19686}, 
}