R1-Vision: Let's first take a look at the image

February 16, 2025 · View on GitHub

R1-Vision: Let's first take a look at the image

[🤗 Cold-Start Dataset] [📜 Report (Coming Soon)]

DeepSeek-R1 demonstrates outstanding reasoning abilities when tackling math, coding, puzzle, and science problems, as well as responding to general inquiries. However, as a text-only reasoning model, R1 cannot process multimodal inputs like images, which limits its practicality in certain situations. Exploring the potential for multimodal reasoning is an intriguing prospect.

We build this project to create a model that can reason with both text and images.

🔥 News

2025/02/08: We are excited to announce the release of the initial version of our cold-start dataset on 🤗 HuggingFace. This first trial employs the DeepSeek-R1-Distill-Qwen-32B model for reasoning and the GPT-4o-mini model for image captioning and data formatting.

Note

We are actively working on developing enhanced versions that will:

Incorporate more powerful models.
Increase task diversity.
Improve sample quality.

Stay tuned for updates!

🚀 Stage1: Cold Start

We explore distilling the strong reasoning capabilities from a Large Language Model (LLM) such as R1 to a Large Vision-Language Model (LVLM). Specifically, we utilize three kinds of data, including:

Text Data: Text-only reasoning datasets.
Text Rendering Data: Curated from text-only reasoning datasets, ultilizing a reformatting and rendering pipeline. We adopt these data to encourage identical response to different modality inputs.
Multimodal Data: Curated from raw multimodal datasets. We adopt a simple strategy to mitigate the absence of vision capabilities in text-only reasoning models, called Caption-Prefixing.

Type	Source Dataset	Numbers
Text	Bespoke-Stratos-17k	16.7k
Text Rendering	Bespoke-Stratos-17k	12.6k
Multimodal	AI2D	7.8k
Text / Multimodal	ScienceQA	9.9k
Multimodal	PixMo-Cap-QA	19.4k

👁 Caption-Prefixing

Similar to the reasoning forcing trick, we make the model pretend to "see" the image by captioning the image in the beginning of the thinking process. We use a simple template:

# English
prefix_en = "<think>\nHmm, to solve this problem, let's first take a look at the image. {}\n\nNow".format(image_caption)

# Chinese
prefix_zh = "<think>\n嗯，为了回答这个问题，让我先看一下图片。{}\n\n首先".format(image_caption)

There are two important switches for entering and exiting the caption mode:

Enter: let's first take a look at the image.
Exit: Now

Tip

It is worth noting that the exit switch originates from the original R1 thought process, which helps the model stop captioning and avoids hallucinations.

This method achieved well-formatted thoughts and solutions, without the need for heavy post-processings like LLM reformatting. To clarify, we can diversify the switch styles by string replacement.

Examples

Image	Caption	Thought	Solution
	The image features a close-up view of the legs of a frog, prominently displaying its unique anatomical characteristics. The frog is positioned on a wooden surface, which introduces a warm brown hue that contrasts with the pale yellow of the frog's legs. ...	Now, considering the question: which animal's feet are also adapted for sticking to smooth surfaces? The options are meerkat and Madagascar day gecko. Given that the image shows a frog, which uses sticky pads on its feet to adhere to surfaces, the most likely answer is the Madagascar day gecko, as geckos are known for their ability to climb smooth surfaces using specialized pads. Meerkats, on the other hand, are more adapted for digging and do not have such sticky feet.	The correct answer is: B. Madagascar day gecko Geckos, including the Madagascar day gecko, have specialized toe pads that allow them to stick to smooth surfaces. This adaptation is due to millions of tiny hairs called setae, which create a strong adhesive force through molecular attraction.
	The image depicts a stylized representation of a scarf that features a bold pattern of alternating stripes. The scarf predominantly displays red and cream-colored stripes, with the red stripes being more prominent. The design consists of several equal-width horizontal stripes: the upper part of the scarf shows a vibrant red stripe followed by a creamy ivory stripe, and this pattern continues down the length of the scarf. ...	Now, considering the question: Is a scarf a solid or a liquid? The options are A. a solid and B. a liquid. From the description and the image, it's clear that a scarf is a physical object made from materials like wool or yarn. These materials are solid in their form. Liquids, on the other hand, take the shape of their container and flow freely, which does not apply to a scarf. Therefore, the scarf is a solid.	A. a solid

📝 Text Rendering

Reformatting the original question with an LLM.
Rendering the reformatted LaTeX files on images.

Examples

Original Text	Rendered Image
$</span>P(x)<span>$ is a polynomial of degree $</span>3n<span>$ such that\n\begin{eqnarray} P(0) = P(3) = \cdots &=& P(3n) = 2, \\ P(1) = P(4) = \cdots &=& P(3n-2) = 1, \\ P(2) = P(5) = \cdots &=& P(3n-1) = 0, \quad\text{ and }\\ && P(3n+1) = 730.\end{eqnarray}\nDetermine $n$ .
Jose, Thuy, and Kareem each start with the number 10. Jose subtracts 1 from the number 10, doubles his answer, and then adds 2. Thuy doubles the number 10, subtracts 1 from her answer, and then adds 2. Kareem subtracts 1 from the number 10, adds 2 to his number, and then doubles the result. Who gets the largest final answer?

@misc{yu25r1vision,
  author       = {Ya{-}Qi Yu and Minghui Liao and and Feilong Chen and Jihao Wu and Chao Weng},
  title        = {R1-Vision: Let's first take a look at the image},
  howpublished = {\url{https://github.com/yuyq96/R1-Vision}},
  note         = {Accessed: 2025-02-08},
  year         = {2025}
}

🤝 Acknowledgement

R1-Vision is built with reference to the code or data of the following projects: DeepSeek-R1, Bespoke-Stratos-17k, AI2D, ScienceQA, PixMo. Thanks for their awesome work!

R1-Vision: Let's first take a look at the image

R1-Vision: Let's first take a look at the image

🔥 News

🚀 Stage1: Cold Start

👁 Caption-Prefixing

Examples

📝 Text Rendering

Examples

📈 Performance

🧠 Stage2: RL

👨🏻‍💻 Citation

🤝 Acknowledgement