R1-Vision: Let's first take a look at the image

February 16, 2025 ยท View on GitHub

R1-Vision: Let's first take a look at the image

image

[๐Ÿค— Cold-Start Dataset] [๐Ÿ“œ Report (Coming Soon)]

DeepSeek-R1 demonstrates outstanding reasoning abilities when tackling math, coding, puzzle, and science problems, as well as responding to general inquiries. However, as a text-only reasoning model, R1 cannot process multimodal inputs like images, which limits its practicality in certain situations. Exploring the potential for multimodal reasoning is an intriguing prospect.

We build this project to create a model that can reason with both text and images.

๐Ÿ”ฅ News

  • 2025/02/08: We are excited to announce the release of the initial version of our cold-start dataset on ๐Ÿค— HuggingFace. This first trial employs the DeepSeek-R1-Distill-Qwen-32B model for reasoning and the GPT-4o-mini model for image captioning and data formatting.

Note

We are actively working on developing enhanced versions that will:

  • Incorporate more powerful models.
  • Increase task diversity.
  • Improve sample quality.

Stay tuned for updates!

๐Ÿš€ Stage1: Cold Start

We explore distilling the strong reasoning capabilities from a Large Language Model (LLM) such as R1 to a Large Vision-Language Model (LVLM). Specifically, we utilize three kinds of data, including:

  • Text Data: Text-only reasoning datasets.
  • Text Rendering Data: Curated from text-only reasoning datasets, ultilizing a reformatting and rendering pipeline. We adopt these data to encourage identical response to different modality inputs.
  • Multimodal Data: Curated from raw multimodal datasets. We adopt a simple strategy to mitigate the absence of vision capabilities in text-only reasoning models, called Caption-Prefixing.
TypeSource DatasetNumbers
TextBespoke-Stratos-17k16.7k
Text RenderingBespoke-Stratos-17k12.6k
MultimodalAI2D7.8k
Text / MultimodalScienceQA9.9k
MultimodalPixMo-Cap-QA19.4k

๐Ÿ‘ Caption-Prefixing

Similar to the reasoning forcing trick, we make the model pretend to "see" the image by captioning the image in the beginning of the thinking process. We use a simple template:

# English
prefix_en = "<think>\nHmm, to solve this problem, let's first take a look at the image. {}\n\nNow".format(image_caption)

# Chinese
prefix_zh = "<think>\nๅ—ฏ๏ผŒไธบไบ†ๅ›ž็ญ”่ฟ™ไธช้—ฎ้ข˜๏ผŒ่ฎฉๆˆ‘ๅ…ˆ็œ‹ไธ€ไธ‹ๅ›พ็‰‡ใ€‚{}\n\n้ฆ–ๅ…ˆ".format(image_caption)

There are two important switches for entering and exiting the caption mode:

  • Enter: let's first take a look at the image.
  • Exit: Now

Tip

It is worth noting that the exit switch originates from the original R1 thought process, which helps the model stop captioning and avoids hallucinations.

This method achieved well-formatted thoughts and solutions, without the need for heavy post-processings like LLM reformatting. To clarify, we can diversify the switch styles by string replacement.

Examples

ImageCaptionThoughtSolution
c8e3f7c11306f3dd1cf3d73118963eabThe image features a close-up view of the legs of a frog, prominently displaying its unique anatomical characteristics. The frog is positioned on a wooden surface, which introduces a warm brown hue that contrasts with the pale yellow of the frog's legs.
...
Now, considering the question: which animal's feet are also adapted for sticking to smooth surfaces? The options are meerkat and Madagascar day gecko.
Given that the image shows a frog, which uses sticky pads on its feet to adhere to surfaces, the most likely answer is the Madagascar day gecko, as geckos are known for their ability to climb smooth surfaces using specialized pads. Meerkats, on the other hand, are more adapted for digging and do not have such sticky feet.
The correct answer is:
B. Madagascar day gecko
Geckos, including the Madagascar day gecko, have specialized toe pads that allow them to stick to smooth surfaces. This adaptation is due to millions of tiny hairs called setae, which create a strong adhesive force through molecular attraction.
51a35e884e8aac91c5419251fe27dbb9The image depicts a stylized representation of a scarf that features a bold pattern of alternating stripes. The scarf predominantly displays red and cream-colored stripes, with the red stripes being more prominent. The design consists of several equal-width horizontal stripes: the upper part of the scarf shows a vibrant red stripe followed by a creamy ivory stripe, and this pattern continues down the length of the scarf.
...
Now, considering the question: Is a scarf a solid or a liquid? The options are A. a solid and B. a liquid. From the description and the image, it's clear that a scarf is a physical object made from materials like wool or yarn. These materials are solid in their form. Liquids, on the other hand, take the shape of their container and flow freely, which does not apply to a scarf. Therefore, the scarf is a solid.A. a solid

๐Ÿ“ Text Rendering

  • Reformatting the original question with an LLM.
  • Rendering the reformatted LaTeX files on images.

Examples

Original TextRendered Image
</span>P(x)<span></span>P(x)<span> is a polynomial of degree </span>3n<span></span>3n<span> such that\n\begin{eqnarray*} P(0) = P(3) = \cdots &=& P(3n) = 2, \\ P(1) = P(4) = \cdots &=& P(3n-2) = 1, \\ P(2) = P(5) = \cdots &=& P(3n-1) = 0, \quad\text{ and }\\ && P(3n+1) = 730.\end{eqnarray*}\nDetermine nn.2
Jose, Thuy, and Kareem each start with the number 10. Jose subtracts 1 from the number 10, doubles his answer, and then adds 2. Thuy doubles the number 10, subtracts 1 from her answer, and then adds 2. Kareem subtracts 1 from the number 10, adds 2 to his number, and then doubles the result. Who gets the largest final answer?8

๐Ÿ“ˆ Performance

TODO: Train and evaluate TextHawk2-7B and Qwen2.5-VL-7B.

๐Ÿง  Stage2: RL

TODO: Explore RL for LVLMs.

๐Ÿ‘จ๐Ÿปโ€๐Ÿ’ป Citation

If you find this project useful in your research, please consider cite:

@misc{yu25r1vision,
  author       = {Ya{-}Qi Yu and Minghui Liao and and Feilong Chen and Jihao Wu and Chao Weng},
  title        = {R1-Vision: Let's first take a look at the image},
  howpublished = {\url{https://github.com/yuyq96/R1-Vision}},
  note         = {Accessed: 2025-02-08},
  year         = {2025}
}

๐Ÿค Acknowledgement

R1-Vision is built with reference to the code or data of the following projects: DeepSeek-R1, Bespoke-Stratos-17k, AI2D, ScienceQA, PixMo. Thanks for their awesome work!