π Introduction
December 1, 2025 Β· View on GitHub
LogicOCR: Do Your Large Multimodal Models Excel at Logical Reasoning on Text-Rich Images?
Maoyuan Ye1, Haibin He1, Qihuang Zhong1, Jing Zhang1 βοΈ, Juhua Liu1 βοΈ, Bo Du1
1 Wuhan University
π Introduction
We introduce LogicOCR, a benchmark comprising 2,780 questions with two subsets, i.e., LogicOCR-Gen with 1,100 multi-choice questions on generated images, and LogicOCR-Real with 1,680 meticulously designed free-form questions on real-world images, to evaluate the logical reasoning abilities of Large Multimodal Models (LMMs) on text-rich images, while minimizing reliance on complex STEM knowledge. For constructing LogicOCR-Gen, we first curate a text corpus from the Chinese National Civil Servant Examination, and customize an automatic pipeline to steer GPT-Image-1 to generate images with varied layouts and fonts, ensuring contextual relevance and visual realism. Then, the generated images are manually verified. We evaluate a range of representative LMMs under Chain-of-Thought (CoT) and direct-answer settings. Our multi-dimensional analysis reveals key insights, such as the impact of test-time scaling, input modality differences, and sensitivity to visual-text orientation. Notably, LMMs still lag in multimodal reasoning compared to text-only inputs, indicating that they have not fully bridged visual reading with reasoning. Moreover, we propose TextCue, a training-free method that enhances LMMsβ perception of image regions containing important text cues for solving questions. We leverage LMMs' attention maps and an off-the-shelf text segmentation specialist to determine the region, which is then cropped and enlarged to augment the original image.
π Key Findings
- CoT does not consistently improve accuracy on LogicOCRβmost models fail to reason better step-by-step, suggesting flaws in their reasoning paths.
- Test-time scaling significantly improves performance on LogicOCR, though the efficiency of open-source LMMs still leaves room for improvement
- State-of-the-art LMMs still fall short of fully integrating visual reading and reasoning. While vision-language alignment suffices for perception tasks like OCR, it remains inadequate for more complex reasoning, especially as model size grows.
- The perception robustness of LMMs across different visual-text orientations needs further improvement. Perturbations like image rotation can reduce accuracy to near-random levels.
For main results and detailed analysis, please refer to the paper.
π₯ News
-
[
11/28/2025]: A new version of paper is updated. LogicOCR consists of two subsets now, i.e., LogicOCR-Gen with 1,100 multi-choice questions on generated images, and LogicOCR-Real with 1,680 meticulously designed free-form questions on real-world images. -
[
05/16/2025]: Release the dataset on huggingface. Release the codes.
π Main Results


π¨ Evaluation
- Setup
Clone this repo and download the images and JSON file:
git clone https://github.com/MiliLab/LogicOCR
cd LogicOCR
wget https://huggingface.co/datasets/MiliLab/LogicOCR/resolve/main/LogicOCR_gen.zip
wget https://huggingface.co/datasets/MiliLab/LogicOCR/resolve/main/LogicOCR_real.zip
unzip LogicOCR_gen.zip && rm LogicOCR_gen.zip
unzip LogicOCR_real.zip && rm LogicOCR_real.zip
wget https://huggingface.co/datasets/MiliLab/LogicOCR/resolve/main/LogicOCR_gen.json
wget https://huggingface.co/datasets/MiliLab/LogicOCR/resolve/main/LogicOCR_real.json
- Recommed Environment
python>=3.10, torch 2.5.1, torchvision 0.20.1, transformers >= 4.49.0, flash-attn 2.7.4.post1, and see requirement.txt
- Evaluate LMMs
Some evaluation scripts are provided in infer_models and infer_models_real.
For evaluation on LogicOCR-Gen:
bash eval_gen.sh
For evaluation on LogicOCR-Real:
bash eval_real.sh
Report the overall and detailed accuracy, for example:
python get_score.py \
--gen_json res/LLaVA-OneVision-1.5-8B-Instruct_image-text_cot.json \
--real_json res_real/LLaVA-OneVision-1.5-8B-Instruct_image-text_cot.json \
- (Optional) Evaluate OCR and Two-Step Performance
bash eval_ocr.sh
You can also find the existing OCR evaluation results in huggingface repo.
βΆοΈ Text-to-Image Generation
If you want to generate images in yourself, a JSON file with 3 samples and a simple script are provided for reference. You can run the following commands. The generated images will be saved in gen_images/saved_folder
cd gen_images
python gpt_generate.py samples.json $YOUR_API_KEY $YOUR_BASE_URL $NUM_WORKERS
π License
LogicOCR is licensed under CC BY-NC-SA 4.0.
π Acknowledgement
The raw text corpora for constructing LogicOCR-Gen are collected from LogiQA and LogiQA2.0.
The inference script is modified from OCRBench. The OCR evaluation tool is modified from Fox.
:black_nib: Citation
If you find LogicOCR helpful, please consider giving this repo a :star: and citing:
@article{ye2025logicocr,
title={LogicOCR: Do Your Large Multimodal Models Excel at Logical Reasoning on Text-Rich Images?},
author={Maoyuan Ye and Haibin He and Qihuang Zhong and Jing Zhang and Juhua Liu and Bo Du},
journal={arXiv preprint arXiv:2505.12307},
year={2025}
}