README.md
February 4, 2026 ¡ View on GitHub
X-Reasoner: Towards Generalizable Reasoning Across Modalities and Domains
A training recipe that optimizes the reasoning capability of VLMs with SFT and RL on general-domain text-only data.
Paper Arxiv Link đď¸News
- Feb 3, 2026: Our model is released đ¤ model link
- May 6, 2025: Our paper is released on arxiv: arxiv link
Results

Experiments show that X-Reasoner successfully transfers reasoning capabilities to both multimodal and out-of-domain settings, outperforming existing state-of-the-art models trained with in-domain and multimodal data across various general and medical benchmarks
Method

XâReasoner is built with a twoâstage, textâonly postâtraining pipeline that adds robust reasoning skills to a visionâlanguage model (VLM) and lets those skills generalize across both modalities and domains.
Step 1:âŻSupervised fineâtuning (SFT) with long chainsâofâthought
- Starting pointâWe begin with the instructionâtuned
Qwenâ2.5âVLâ7BâInstructcheckpoint, which already follows prompts but lacks explicit reasoning ability. - DataâWe fineâtune on OpenThoughtsâ114k, a 114âŻkâexample dataset of math, coding, and science questions whose long chainâofâthought (CoT) rationales were distilled from the stronger DeepSeekâR1 model. Link: open-thoughts
- Algorithm We train the model with SFT for 4 epochs.
Step 2:âŻReinforcement learning with verifiable rewards (RLVR)
- Why RL?âSFT yields structured reasoning but can overâexplain and drift; RL trims length and sharpens correctness.
- DataâWe use OrzâMathâ57k, a curated math dataset for RLVR from Open Reasonder Zero.
- AlgorithmâWe adopt GroupâRelative Policy Optimisation (GRPO), a PPOâstyle update that compares multiple sampled answers per query. Outcome supervision is applied with exact answer correctness (1âŻ/âŻ0).
Domainâspecific extension: XâReasonerâMed
To demonstrate that X-Reasoner serves as a strong foundation for domain specialization, we further train it on the MedQA dataset using the same SFTâŻââŻRLVR recipe. The resulting XâReasonerâMed sets new 7Bâscale stateâofâtheâart scores on a suite of textual and imageâbased medicalâreasoning tasks.
Key design choices
- Pure textual supervisionâAll optimisation steps use only text; the frozen vision encoder still benefits from better languageâside reasoning.
- General-domain supervision We leverage general-domain data to promote cross-domain generalization and to initialize domain-specialization.
- Long CoTâŻ+âŻverifiable RLâSFT teaches rich, stepâbyâstep reasoning; RLVR then rewards finalâanswer accuracy, delivering strong, robust gains.
References
The codebase for SFT is a fork Llama Cookbook (we forked it when it was named llama-recpies).
This codebase for RL is a fork of the veRL project with the support for vision language models.
We thank all the authors for providing such a high-performance SFT and RL training framework.
Citation
@misc{liu2025xreasonergeneralizablereasoningmodalities,
title={X-Reasoner: Towards Generalizable Reasoning Across Modalities and Domains},
author={Qianchu Liu and Sheng Zhang and Guanghui Qin and Timothy Ossowski and Yu Gu and Ying Jin and Sid Kiblawi and Sam Preston and Mu Wei and Paul Vozila and Tristan Naumann and Hoifung Poon},
year={2025},
eprint={2505.03981},
archivePrefix={arXiv},
primaryClass={cs.CL},
url={https://arxiv.org/abs/2505.03981}
}