Learning Only with Images: Visual Reinforcement Learning with Reasoning, Rendering, and Visual Feedback

August 7, 2025 Β· View on GitHub


The official repo for "Learning Only with Images: Visual Reinforcement Learning with Reasoning, Rendering, and Visual Feedback".

πŸ€— Cold-start Dataset Β  | Β  πŸ€— RRVF-Chart Model Β  | Β  πŸ€— RRVF-Web Model Β  | Β  πŸ“‘ Paper

πŸ”₯ Next up: Scaling RRVF for more diverse and complex visual applications!

πŸ“’ News

Table of Contents

✨ Framework Overview

RRVF is a training framework that enhances the visual reasoning capabilities of MLLMs using purely visual signals. Its core is a closed-loop system comprising three key components: an iterative visual reasoner, a visual feedback mechanism, and a final visual judge.

Figure: The RRVF framework.

Iterative Visual Reasoning

The reasoning process is iterative. The model receives an image and produces a response containing internal thoughts (in <think> tags) and a specific action (in a <tool_call> tag). After the tool executes the code, the visual feedback is appended to the conversation history to inform the model's next turn, until it generates a final solution (in an <answer> tag).

Visual Feedback Mechanism

This is the key to guiding the model's learning.

  1. Rendering: The model-generated code is executed by a domain-specific engine (e.g., Matplotlib for charts, Playwright for web pages) to render an image.
  2. Comparison & Feedback: A more powerful "teacher" MLLM compares the rendered image to the original, articulating the visual discrepancies (e.g., color, layout, missing elements) in natural language. This descriptive feedback provides actionable guidance for the model.

Reinforcement Learning Optimization

The entire closed-loop process is formulated as a reinforcement learning task and optimized with the GRPO algorithm. We designed a hybrid reward function to guide the learning:

  • Visual Similarity Reward (R_vision): Provided by the visual judge, this quantifies the fidelity between the final rendered image and the original input.
  • Format Correctness Reward (R_format): Penalizes improper output formatting and non-executable code.
  • Tool-Use Reward (R_tool): Encourages exploration and iterative refinement by rewarding successful tool calls.

πŸ“Š Main Results

Chart-to-Code Task

Results on the ChartMimic test set:

ModelExec rateTextLayoutTypeColorGPT-4o scoreOverall
Closed-Source MLLMs
(2024/02) Gemini-1.0-Pro-Vision68.2*52.6*64.2*51.3*47.1*53.3*53.6*
(2024/11) GPT-4o-2024-11-2090.0066.5579.3171.8360.8482.5076.06
(2025/04) OpenAI o390.1774.1780.5871.3763.7486.4579.46
(2025/05) Claude-4-Sonnet91.8368.8782.4367.1357.5985.4677.23
(2025/06) Gemini-2.5-Pro93.3384.9583.3775.0566.9090.5884.07
Open-Source MLLMs
(2025/02) Qwen2.5-VL-72B-Instruct83.8334.4461.7145.4935.1250.4147.30
(2024/03) DeepSeek-VL-7B41.3*15.3*26.6*19.7*14.5*20.4*19.7*
(2025/02) LLaVA-OneVision-7B17.287.9713.559.157.3610.019.76
(2025/02) Qwen2.5-VL-7B-Instruct68.8330.0155.7936.5026.9139.0438.17
(2025/04) InternVL3-8B71.6745.0357.8945.8738.8854.9150.91
------------------------
SFT [with text labels]69.0056.9763.6060.5351.8962.0960.17
Ξ” (vs Qwen2.5-VL-7B-Instruct)+0.17+26.96+7.81+24.03+24.98+23.05+22.00
------------------------
RRVF (Ours) [without text labels]97.8362.4780.9753.5646.4167.8764.36
Ξ” (vs Qwen2.5-VL-7B-Instruct)+29.00+32.46+25.18+17.06+19.50+28.83+26.19

note: Performance comparison on the ChartMimic benchmark. We report the metrics from the original ChartMimic benchmark. The best and second-best results among open-source models under 10B parameters are bolded and underlined, respectively. Results marked with * are reported by the original benchmark.


Results on Plot2Code (Zero-Shot):

ModelExec RateTextGPT-4o ScoreTextpassGPT-4o Scorepass
Closed-Source MLLMs
(2023/09) GPT-4V84.1*48.53*5.45*57.7*6.48*
(2024/02) Gemini-1.0-Pro-Vision68.2*36.56*3.45*53.6*5.06*
(2024/06) Claude-3-Sonnet75.8*35.40*4.08*46.7*5.38*
(2024/11) GPT-4o-2024-11-2090.1548.916.0954.256.76
(2025/04) OpenAI o387.1257.656.7066.177.69
(2025/05) Claude-4-Sonnet92.4256.866.1661.526.76
(2025/06) Gemini-2.5-Pro87.8871.707.6581.598.71
Open-Source MLLMs
(2025/02) Qwen2.5-VL-72B-Instruct83.3356.745.7968.096.95
(2024/03) Mini-Gemini-8x7B-HD73.5*29.91*2.84*40.7*3.87*
(2025/02) LLaVA-OneVision-7B84.0926.722.7531.783.27
(2025/02) Qwen2.5-VL-7B-Instruct70.4635.803.4050.814.82
(2025/04) InternVL3-8B76.5230.673.2540.084.25
------------------
SFT [with text labels, ChartMimic trained]49.2421.632.4743.935.02
Ξ” (vs Qwen2.5-VL-7B-Instruct)-21.22-14.17-0.93--
------------------
RRVF (Ours) [without text labels]96.2139.894.4441.464.61
Ξ” (vs Qwen2.5-VL-7B-Instruct)+25.75+4.09+1.04--

note:
Performance comparison on the Plot2Code benchmark. The best and second-best results on the primary metrics (Exec Rate, Text, GPT-4o Score) among open-source models under 10B parameters are bolded and underlined, respectively. Results marked with * are reported by the original benchmark.


Web-to-Code Task

Results on the WebSight test set:

ModelCLIP ScoreGPT Score
Closed-Source MLLMs
GPT-4o-2024-11-2088.9494.55
OpenAI o391.5896.49
Claude-4-Sonnet92.3096.46
Gemini-2.5-Pro77.8375.88
Open-Source MLLMs
LLaVA-OneVision-7B79.7472.61
Qwen2.5-VL-7B-Instruct83.5084.17
InternVL3-8B84.1785.54
---------
RRVF (Ours)88.2991.50

note: Performance comparison on the WebSight benchmark for web interface generation. The best results among open-source models under 10B parameters are bolded.


πŸ“ Citation

If you use our work in your research, please cite our paper:

@misc{chen2025learningimagesvisualreinforcement,
      title={Learning Only with Images: Visual Reinforcement Learning with Reasoning, Rendering, and Visual Feedback}, 
      author={Yang Chen and Yufan Shen and Wenxuan Huang and Sheng Zhou and Qunshu Lin and Xinyu Cai and Zhi Yu and Jiajun Bu and Botian Shi and Yu Qiao},
      year={2025},
      eprint={2507.20766},
      archivePrefix={arXiv},
      primaryClass={cs.CV},
      url={https://arxiv.org/abs/2507.20766}, 
}

πŸ™ Acknowledgements

  • We thank the Verl, DeepEyes framework.
  • We thank the creators of the ChartMimic, Plot2Code, and WebSight datasets.
  • We thank the VlmEvalkit team.