R1-Vision: Let's first take a look at the image
February 16, 2025 ยท View on GitHub
DeepSeek-R1 demonstrates outstanding reasoning abilities when tackling math, coding, puzzle, and science problems, as well as responding to general inquiries. However, as a text-only reasoning model, R1 cannot process multimodal inputs like images, which limits its practicality in certain situations. Exploring the potential for multimodal reasoning is an intriguing prospect.
We build this project to create a model that can reason with both text and images.
๐ฅ News
2025/02/08: We are excited to announce the release of the initial version of our cold-start dataset on ๐ค HuggingFace. This first trial employs theDeepSeek-R1-Distill-Qwen-32Bmodel for reasoning and theGPT-4o-minimodel for image captioning and data formatting.
Note
We are actively working on developing enhanced versions that will:
- Incorporate more powerful models.
- Increase task diversity.
- Improve sample quality.
Stay tuned for updates!
๐ Stage1: Cold Start
We explore distilling the strong reasoning capabilities from a Large Language Model (LLM) such as R1 to a Large Vision-Language Model (LVLM). Specifically, we utilize three kinds of data, including:
- Text Data: Text-only reasoning datasets.
- Text Rendering Data: Curated from text-only reasoning datasets, ultilizing a reformatting and rendering pipeline. We adopt these data to encourage identical response to different modality inputs.
- Multimodal Data: Curated from raw multimodal datasets. We adopt a simple strategy to mitigate the absence of vision capabilities in text-only reasoning models, called Caption-Prefixing.
| Type | Source Dataset | Numbers |
|---|---|---|
| Text | Bespoke-Stratos-17k | 16.7k |
| Text Rendering | Bespoke-Stratos-17k | 12.6k |
| Multimodal | AI2D | 7.8k |
| Text / Multimodal | ScienceQA | 9.9k |
| Multimodal | PixMo-Cap-QA | 19.4k |
๐ Caption-Prefixing
Similar to the reasoning forcing trick, we make the model pretend to "see" the image by captioning the image in the beginning of the thinking process. We use a simple template:
# English
prefix_en = "<think>\nHmm, to solve this problem, let's first take a look at the image. {}\n\nNow".format(image_caption)
# Chinese
prefix_zh = "<think>\nๅฏ๏ผไธบไบๅ็ญ่ฟไธช้ฎ้ข๏ผ่ฎฉๆๅ
็ไธไธๅพ็ใ{}\n\n้ฆๅ
".format(image_caption)
There are two important switches for entering and exiting the caption mode:
- Enter:
let's first take a look at the image. - Exit:
Now
Tip
It is worth noting that the exit switch originates from the original R1 thought process, which helps the model stop captioning and avoids hallucinations.
This method achieved well-formatted thoughts and solutions, without the need for heavy post-processings like LLM reformatting. To clarify, we can diversify the switch styles by string replacement.
Examples
| Image | Caption | Thought | Solution |
|---|---|---|---|
| The image features a close-up view of the legs of a frog, prominently displaying its unique anatomical characteristics. The frog is positioned on a wooden surface, which introduces a warm brown hue that contrasts with the pale yellow of the frog's legs. ... | Now, considering the question: which animal's feet are also adapted for sticking to smooth surfaces? The options are meerkat and Madagascar day gecko. Given that the image shows a frog, which uses sticky pads on its feet to adhere to surfaces, the most likely answer is the Madagascar day gecko, as geckos are known for their ability to climb smooth surfaces using specialized pads. Meerkats, on the other hand, are more adapted for digging and do not have such sticky feet. | The correct answer is: B. Madagascar day gecko Geckos, including the Madagascar day gecko, have specialized toe pads that allow them to stick to smooth surfaces. This adaptation is due to millions of tiny hairs called setae, which create a strong adhesive force through molecular attraction. | |
| The image depicts a stylized representation of a scarf that features a bold pattern of alternating stripes. The scarf predominantly displays red and cream-colored stripes, with the red stripes being more prominent. The design consists of several equal-width horizontal stripes: the upper part of the scarf shows a vibrant red stripe followed by a creamy ivory stripe, and this pattern continues down the length of the scarf. ... | Now, considering the question: Is a scarf a solid or a liquid? The options are A. a solid and B. a liquid. From the description and the image, it's clear that a scarf is a physical object made from materials like wool or yarn. These materials are solid in their form. Liquids, on the other hand, take the shape of their container and flow freely, which does not apply to a scarf. Therefore, the scarf is a solid. | A. a solid |
๐ Text Rendering
- Reformatting the original question with an LLM.
- Rendering the reformatted LaTeX files on images.
Examples
| Original Text | Rendered Image |
|---|---|
| is a polynomial of degree such that\n\begin{eqnarray*} P(0) = P(3) = \cdots &=& P(3n) = 2, \\ P(1) = P(4) = \cdots &=& P(3n-2) = 1, \\ P(2) = P(5) = \cdots &=& P(3n-1) = 0, \quad\text{ and }\\ && P(3n+1) = 730.\end{eqnarray*}\nDetermine . | |
| Jose, Thuy, and Kareem each start with the number 10. Jose subtracts 1 from the number 10, doubles his answer, and then adds 2. Thuy doubles the number 10, subtracts 1 from her answer, and then adds 2. Kareem subtracts 1 from the number 10, adds 2 to his number, and then doubles the result. Who gets the largest final answer? |
๐ Performance
TODO: Train and evaluate TextHawk2-7B and Qwen2.5-VL-7B.
๐ง Stage2: RL
TODO: Explore RL for LVLMs.
๐จ๐ปโ๐ป Citation
If you find this project useful in your research, please consider cite:
@misc{yu25r1vision,
author = {Ya{-}Qi Yu and Minghui Liao and and Feilong Chen and Jihao Wu and Chao Weng},
title = {R1-Vision: Let's first take a look at the image},
howpublished = {\url{https://github.com/yuyq96/R1-Vision}},
note = {Accessed: 2025-02-08},
year = {2025}
}
๐ค Acknowledgement
R1-Vision is built with reference to the code or data of the following projects: DeepSeek-R1, Bespoke-Stratos-17k, AI2D, ScienceQA, PixMo. Thanks for their awesome work!