👋 Introduction

December 1, 2025 · View on GitHub

LogicOCR: Do Your Large Multimodal Models Excel at Logical Reasoning on Text-Rich Images?

Maoyuan Ye¹, Haibin He¹, Qihuang Zhong¹, Jing Zhang^{1 ✉️}, Juhua Liu^{1 ✉️}, Bo Du¹
¹ Wuhan University

👋 Introduction

We introduce LogicOCR, a benchmark comprising 2,780 questions with two subsets, i.e., LogicOCR-Gen with 1,100 multi-choice questions on generated images, and LogicOCR-Real with 1,680 meticulously designed free-form questions on real-world images, to evaluate the logical reasoning abilities of Large Multimodal Models (LMMs) on text-rich images, while minimizing reliance on complex STEM knowledge. For constructing LogicOCR-Gen, we first curate a text corpus from the Chinese National Civil Servant Examination, and customize an automatic pipeline to steer GPT-Image-1 to generate images with varied layouts and fonts, ensuring contextual relevance and visual realism. Then, the generated images are manually verified. We evaluate a range of representative LMMs under Chain-of-Thought (CoT) and direct-answer settings. Our multi-dimensional analysis reveals key insights, such as the impact of test-time scaling, input modality differences, and sensitivity to visual-text orientation. Notably, LMMs still lag in multimodal reasoning compared to text-only inputs, indicating that they have not fully bridged visual reading with reasoning. Moreover, we propose TextCue, a training-free method that enhances LMMs’ perception of image regions containing important text cues for solving questions. We leverage LMMs' attention maps and an off-the-shelf text segmentation specialist to determine the region, which is then cropped and enlarged to augment the original image.

📌 Key Findings

CoT does not consistently improve accuracy on LogicOCR—most models fail to reason better step-by-step, suggesting flaws in their reasoning paths.
Test-time scaling significantly improves performance on LogicOCR, though the efficiency of open-source LMMs still leaves room for improvement
State-of-the-art LMMs still fall short of fully integrating visual reading and reasoning. While vision-language alignment suffices for perception tasks like OCR, it remains inadequate for more complex reasoning, especially as model size grows.
The perception robustness of LMMs across different visual-text orientations needs further improvement. Perturbations like image rotation can reduce accuracy to near-random levels.

For main results and detailed analysis, please refer to the paper.

🔥 News

[11/28/2025]: A new version of paper is updated. LogicOCR consists of two subsets now, i.e., LogicOCR-Gen with 1,100 multi-choice questions on generated images, and LogicOCR-Real with 1,680 meticulously designed free-form questions on real-world images.
[05/16/2025]: Release the dataset on huggingface. Release the codes.

📖 Main Results

main_results_fig

main_results

🔨 Evaluation

Setup

Clone this repo and download the images and JSON file:

git clone https://github.com/MiliLab/LogicOCR
cd LogicOCR
wget https://huggingface.co/datasets/MiliLab/LogicOCR/resolve/main/LogicOCR_gen.zip
wget https://huggingface.co/datasets/MiliLab/LogicOCR/resolve/main/LogicOCR_real.zip
unzip LogicOCR_gen.zip && rm LogicOCR_gen.zip
unzip LogicOCR_real.zip && rm LogicOCR_real.zip
wget https://huggingface.co/datasets/MiliLab/LogicOCR/resolve/main/LogicOCR_gen.json
wget https://huggingface.co/datasets/MiliLab/LogicOCR/resolve/main/LogicOCR_real.json

Recommed Environment

python>=3.10, torch 2.5.1, torchvision 0.20.1, transformers >= 4.49.0, flash-attn 2.7.4.post1, and see requirement.txt

Evaluate LMMs

Some evaluation scripts are provided in infer_models and infer_models_real.

For evaluation on LogicOCR-Gen:

bash eval_gen.sh

For evaluation on LogicOCR-Real:

bash eval_real.sh

Report the overall and detailed accuracy, for example:

python get_score.py \
    --gen_json res/LLaVA-OneVision-1.5-8B-Instruct_image-text_cot.json \
    --real_json res_real/LLaVA-OneVision-1.5-8B-Instruct_image-text_cot.json \

(Optional) Evaluate OCR and Two-Step Performance

bash eval_ocr.sh

You can also find the existing OCR evaluation results in huggingface repo.

▶️ Text-to-Image Generation

If you want to generate images in yourself, a JSON file with 3 samples and a simple script are provided for reference. You can run the following commands. The generated images will be saved in gen_images/saved_folder

cd gen_images
python gpt_generate.py samples.json $YOUR_API_KEY $YOUR_BASE_URL $NUM_WORKERS

📜 License

LogicOCR is licensed under CC BY-NC-SA 4.0.

💗 Acknowledgement

The raw text corpora for constructing LogicOCR-Gen are collected from LogiQA and LogiQA2.0.

The inference script is modified from OCRBench. The OCR evaluation tool is modified from Fox.

:black_nib: Citation

If you find LogicOCR helpful, please consider giving this repo a :star: and citing:

@article{ye2025logicocr,
  title={LogicOCR: Do Your Large Multimodal Models Excel at Logical Reasoning on Text-Rich Images?},
  author={Maoyuan Ye and Haibin He and Qihuang Zhong and Jing Zhang and Juhua Liu and Bo Du},
  journal={arXiv preprint arXiv:2505.12307},
  year={2025}
}

LogicOCR: Do Your Large Multimodal Models Excel at Logical Reasoning on Text-Rich Images?

Maoyuan Ye1, Haibin He1, Qihuang Zhong1, Jing Zhang1 ✉️, Juhua Liu1 ✉️, Bo Du1 1 Wuhan University

Maoyuan Ye¹, Haibin He¹, Qihuang Zhong¹, Jing Zhang^{1 ✉️}, Juhua Liu^{1 ✉️}, Bo Du¹
¹ Wuhan University