๐ข News and Updates
March 23, 2026 ยท View on GitHub
Reading or Reasoning? Format Decoupled Reinforcement Learning for Document OCR
๐ข News and Updates
2025.11.26We have released our paper! You can read the PDF file here.2025.11.17We upload our model weights FD-RL to HuggingFace.
๐ค Models
| Model | Download Link |
|---|---|
| FD-RL | DocTron/FD-RL |
๐ Dataset
Overview of our training data construction pipeline. We construct the dataset via (1) open-source dataset quality enhancement, (2) real-world PDF construction, and (3) synthetic OCR data generation, yielding a large-scale dataset covering nine common document categories: notes, financial reports, slides, exam papers, synthetic data, magazines, academic papers, books, and newspapers.

๐ค Pipeline
Overview of the FD-RL training pipeline. Our approach comprises two stages: Stage 1 (SFT) trains FD-RL(SFT) on large-scale document data via supervised fine-tuning, freezing the vision encoder and projector while updating the large language model. Stage 2 (RL) introduces two innovations: (1) Entropy-based data filtration strategy that filters data through type-based filtering (enriching structured data and balancing languages) and entropy-based filtering (retaining high-entropy samples), and (2) Format decoupled RL using separately for different content types: string matching reward for plain text, expression correctness reward for formulas, and structural coherence reward for tables.

๐ Performance
FD-RL is evaluated on OmniDocBench v1.5, a comprehensive document OCR benchmark covering diverse real-world scenarios (e.g., office documents, academic papers, scanned materials). Results show that FD-RL achieves the best performance among end-to-end VLMs, demonstrating the effectiveness of our Format Decoupled RL approach.
End-to-End Evaluation
We conduct an end-to-end evaluation to measure the model's parsing accuracy on PDF page content. The prediction is derived from the model's Markdown output generated by parsing the entire PDF page. The Overall metric is defined as:
| Model Type | Methods | Release Date | End to End | Parameters | Overallโ | TextEditโ | FormulaCDMโ | TableTEDSโ | TableTEDS-Sโ | Reading OrderEditโ |
|---|---|---|---|---|---|---|---|---|---|---|
| Pipeline Tools | Marker-1.8.2 | 2025 | โ | - | 71.30 | 0.206 | 76.66 | 57.88 | 71.17 | 0.250 |
| Mineru2-pipeline | 2025 | โ | - | 75.51 | 0.209 | 76.55 | 70.90 | 79.11 | 0.225 | |
| PP-StructureV3 | 2025 | โ | - | 86.73 | 0.073 | 85.79 | 81.68 | 89.48 | 0.073 | |
| General VLMs | GPT-4o | 2023 | โ | - | 75.02 | 0.217 | 79.70 | 67.07 | 76.09 | 0.148 |
| InternVL3-76B | 2025 | โ | 76B | 80.33 | 0.131 | 83.42 | 70.64 | 77.74 | 0.113 | |
| InternVL3.5-241B | 2025 | โ | 241B | 82.67 | 0.142 | 87.23 | 75.00 | 81.28 | 0.125 | |
| Qwen2.5-VL-72B | 2025 | โ | 72B | 87.02 | 0.094 | 88.27 | 82.15 | 86.22 | 0.102 | |
| Gemini-2.5 Pro | 2025 | โ | - | 88.03 | 0.075 | 85.82 | 85.71 | 90.29 | 0.097 | |
| Specialized VLMs | Dolphin | 2025.05 | โ | 322M | 74.67 | 0.125 | 67.85 | 68.70 | 77.77 | 0.124 |
| MinerU2-VLM | 2025.06 | โ | 0.9B | 85.56 | 0.078 | 80.95 | 83.54 | 87.66 | 0.086 | |
| MonkeyOCR-pro-1.2B | 2025.07 | โ | 1.9B | 86.96 | 0.084 | 85.02 | 84.24 | 89.02 | 0.130 | |
| MonkeyOCR-3B | 2025.07 | โ | 3.7B | 87.13 | 0.075 | 87.45 | 81.39 | 85.92 | 0.129 | |
| MonkeyOCR-pro-3B | 2025.07 | โ | 3.7B | 88.85 | 0.075 | 87.25 | 86.78 | 90.63 | 0.128 | |
| MinerU2.5 | 2025.09 | โ | 1.2B | 90.67 | 0.047 | 88.46 | 88.22 | 92.38 | 0.044 | |
| PaddleOCR-VL | 2025.10 | โ | 0.9B | 92.56 | 0.035 | 91.43 | 89.76 | 93.52 | 0.043 | |
| OCRFlux-3B | 2025.06 | โ | 3B | 74.82 | 0.193 | 68.03 | 75.75 | 80.23 | 0.202 | |
| Mistral OCR | 2025.03 | โ | - | 78.83 | 0.164 | 82.84 | 70.03 | 78.04 | 0.144 | |
| POINTS-Reader | 2025.08 | โ | 3B | 80.98 | 0.134 | 79.20 | 77.13 | 81.66 | 0.145 | |
| olmOCR-7B | 2025.02 | โ | 7B | 81.79 | 0.096 | 86.04 | 68.92 | 74.77 | 0.121 | |
| Nanonets-OCR-s | 2025.06 | โ | 3B | 85.59 | 0.093 | 85.90 | 80.14 | 85.57 | 0.108 | |
| Deepseek-OCR | 2025.10 | โ | 3B | 87.01 | 0.073 | 83.37 | 84.97 | 88.80 | 0.086 | |
| dots.ocr | 2025.07 | โ | 3B | 88.41 | 0.048 | 83.22 | 86.78 | 90.62 | 0.053 | |
| FD-RL | 2025.11 | โ | 4B | 90.41 | 0.049 | 88.67 | 87.35 | 92.10 | 0.055 |
Performance Across Diverse Page Types
To better understand the FD-RL model's capabilities across different document scenarios, the following table reports its text recognition performance (Edit Distance) on 9 document types. This category-wise analysis helps identify the model's strengths and weaknesses in handling diverse real-world documents.
| Model Type | Models | End to End | Slides | Academic Papers | Book | Textbook | Exam Papers | Magazine | Newspaper | Notes | Financial Report |
|---|---|---|---|---|---|---|---|---|---|---|---|
| Pipeline Tools | Marker-1.8.2 | โ | 0.1796 | 0.0412 | 0.1010 | 0.2908 | 0.2958 | 0.1111 | 0.2717 | 0.4656 | 0.0341 |
| MinerU2-pipeline | โ | 0.4244 | 0.0230 | 0.2628 | 0.1224 | 0.0822 | 0.395 | 0.0736 | 0.2603 | 0.0411 | |
| PP-StructureV3 | โ | 0.0794 | 0.0236 | 0.0415 | 0.1107 | 0.0945 | 0.0722 | 0.0617 | 0.1236 | 0.0181 | |
| General VLMs | GPT-4o | โ | 0.1019 | 0.1203 | 0.1288 | 0.1599 | 0.1939 | 0.142 | 0.6254 | 0.2611 | 0.3343 |
| InternVL3-76B | โ | 0.0349 | 0.1052 | 0.0629 | 0.0827 | 0.1007 | 0.0406 | 0.5826 | 0.0924 | 0.0665 | |
| InternVL3.5-241B | โ | 0.0475 | 0.0857 | 0.0237 | 0.1061 | 0.0933 | 0.0577 | 0.6403 | 0.1357 | 0.1117 | |
| Qwen2.5-VL-72B | โ | 0.0422 | 0.0801 | 0.0586 | 0.1146 | 0.0681 | 0.0964 | 0.238 | 0.1232 | 0.0264 | |
| Gemini-2.5 Pro | โ | 0.0326 | 0.0182 | 0.0694 | 0.1618 | 0.0937 | 0.0161 | 0.1347 | 0.1169 | 0.0169 | |
| Specialized VLMs | Dolphin | โ | 0.0957 | 0.0453 | 0.0616 | 0.1333 | 0.1684 | 0.0702 | 0.2388 | 0.2561 | 0.0186 |
| MinerU2-VLM | โ | 0.0745 | 0.0104 | 0.0357 | 0.1276 | 0.0698 | 0.0652 | 0.1831 | 0.0803 | 0.0236 | |
| MonkeyOCR-pro-1.2B | โ | 0.0961 | 0.0354 | 0.053 | 0.111 | 0.0887 | 0.0494 | 0.0995 | 0.1686 | 0.0198 | |
| MonkeyOCR-pro-3B | โ | 0.0904 | 0.0362 | 0.0489 | 0.1072 | 0.0745 | 0.0475 | 0.0962 | 0.1165 | 0.0196 | |
| MinerU2.5 | โ | 0.0294 | 0.0235 | 0.0332 | 0.0499 | 0.0681 | 0.0316 | 0.054 | 0.1161 | 0.0104 | |
| OCRFlux | โ | 0.0870 | 0.0867 | 0.0818 | 0.1843 | 0.2072 | 0.1048 | 0.7304 | 0.1567 | 0.0193 | |
| Mistral-OCR | โ | 0.0917 | 0.0531 | 0.0610 | 0.1341 | 0.1341 | 0.0581 | 0.5643 | 0.3097 | 0.0523 | |
| POINTS-Reader | โ | 0.0334 | 0.0779 | 0.0671 | 0.1372 | 0.1901 | 0.1343 | 0.3789 | 0.0937 | 0.0951 | |
| olmOCR-7B | โ | 0.0497 | 0.0365 | 0.0539 | 0.1204 | 0.0728 | 0.0697 | 0.2916 | 0.122 | 0.0459 | |
| Nanonets-OCR-s | โ | 0.0551 | 0.0578 | 0.0606 | 0.0931 | 0.0834 | 0.0917 | 0.1965 | 0.1606 | 0.0395 | |
| dots.ocr | โ | 0.0290 | 0.0231 | 0.0433 | 0.0788 | 0.0467 | 0.0221 | 0.0667 | 0.1116 | 0.0076 | |
| FD-RL | โ | 0.0235 | 0.0258 | 0.0300 | 0.0867 | 0.0464 | 0.0235 | 0.1069 | 0.0881 | 0.0091 |
๐ Usage Example
Inference
This below is a simple example of how to use FD-RL for document parsing tasks.
Using ๐ค Transformers to Chat
Please first install transformers using the following command:
pip install "transformers>=4.57.0"
from transformers import Qwen3VLForConditionalGeneration, AutoProcessor
import torch
# Load model
model_path = 'DocTron/FD-RL'
model = Qwen3VLForConditionalGeneration.from_pretrained(
model_path,
dtype="auto",
device_map="cuda",
trust_remote_code=True
)
processor = AutoProcessor.from_pretrained(model_path, trust_remote_code=True)
# Prepare input with image and text
image_path = "./assets/FD-RL_test.jpg"
# We recommend using the following prompt to better performance, since it is used throughout the training process.
prompt = "Extract the main content from the document in the image, keeping the original structure. Convert all formulas to LaTeX and all tables to HTML."
messages = [
{
"role": "user",
"content": [
{"type": "image", "image": image_path},
{"type": "text", "text": prompt},
]
}
]
# Preparation for inference
inputs = processor.apply_chat_template(
messages,
tokenize=True,
add_generation_prompt=True,
return_dict=True,
return_tensors="pt"
)
inputs = inputs.to(model.device)
# Inference: Generation of the output
generated_ids = model.generate(**inputs, max_new_tokens=8192, do_sample=False)
generated_ids = [
output_ids[len(input_ids):] for input_ids, output_ids in zip(inputs.input_ids, generated_ids)
]
output_text = processor.tokenizer.batch_decode(
generated_ids, skip_special_tokens=True, clean_up_tokenization_spaces=False
)
print(output_text[0])
# $$
# r = \frac{\alpha}{\beta} \sin \beta (\sigma_1 \pm \sigma_2)
# $$
Online Serving
We suggest using vLLM for efficient FD-RL inference. Make sure to install vllm==0.11.0 or above for full compatibility.
For detailed instructions on serving and inference with multimodal models, visit the vLLM documentation.
pip install "vllm>=0.11.0"
- vLLM server
MODEL_PATH=path/to/your/model
TENSOR_PARALLEL_SIZE=depends/on/your/hardware
python3 -m vllm.entrypoints.openai.api_server \
--model $MODEL_PATH \
--host 0.0.0.0 \
--port 22002 \
--tensor-parallel-size $TENSOR_PARALLEL_SIZE \
--max-num-seqs 80 \
--max-model-len 16384 \
--gpu-memory-utilization 0.9
- Image Request Example
from openai import OpenAI
MODEL_PATH="path/to/your/model"
FILE_PATH="path/to/your/image"
client = OpenAI(
base_url="http://127.0.0.1:22002/v1",
api_key="EMPTY"
)
response = client.chat.completions.create(
model=MODEL_PATH,
messages=[
{
"role": "user",
"content": [
{
"type": "text",
"text": "Extract the main content from the document in the image, keeping the original structure. Convert all formulas to LaTeX and all tables to HTML."
},
{
"type": "image_url",
"image_url": {
"url": f"file://{FILE_PATH}"
}
}
]
}
]
)
print(response.choices[0].message.content)
Fine-tuning
SFT
For users interested in further fine-tuning our model, we suggest employing Llama Factory. Complete installation and usage information is documented in the official repository. Below we present an example fine-tuning configuration:
PROJECT_DIR=/path/to/llama_factory
cd ${PROJECT_DIR}
# Set parameters
GPUS_PER_NODE=8 # Number of GPUs per node
NNODES=1 # Total number of nodes
NODE_RANK=0 # Rank of the current node (starts from 0)
MASTER_ADDR=localhost # IP address of the master node
MASTER_PORT=12345 # Port for communication between nodes
MODEL_DIR=/path/to/ocrverse_text_model # Path to the pre-trained OCRVerse model
DATA=/name/of/your/dataset # Name/path of your custom dataset
OUTPUT_DIR=/path/to/output # Directory to save fine-tuned results
# Llama Factory-based fine-tuning script
torchrun --nproc_per_node="${GPUS_PER_NODE}" --nnodes="${NNODES}" --node_rank="${NODE_RANK}" --master_addr="${MASTER_ADDR}" --master_port="${MASTER_PORT}" \
src/train.py \
--model_name_or_path "$MODEL_DIR" \
--stage sft \
--do_train True \
--finetuning_type full \
--dataset "$DATA" \
--template qwen3_vl_nothink \
--cutoff_len 8192 \
--preprocessing_num_workers 128 \
--preprocessing_batch_size 256 \
--dataloader_num_workers 128 \
--output_dir "$OUTPUT_DIR" \
--logging_steps 1 \
--save_steps 5000 \
--plot_loss True \
--save_only_model False \
--report_to none \
--per_device_train_batch_size 1 \
--gradient_accumulation_steps 1 \
--learning_rate 1e-5 \
--num_train_epochs 1 \
--lr_scheduler_type cosine \
--warmup_ratio 0.1 \
--bf16 True
๐ง Contact
For any questions or collaboration inquiries, please don't hesitate to contact me at zhongyufeng21@mails.ucas.ac.cn.
๐ Acknowledgement
We sincerely appreciate LLaMA-Factory for providing reference training framework.
๐ Citation
If you find this project useful, please feel free to leave a star and cite our paper:
@article{zhong2025reading,
title={Reading or Reasoning? Format Decoupled Reinforcement Learning for Document OCR},
author={Zhong, Yufeng and Chen, Lei and Zeng, Zhixiong and Zhao, Xuanle and Jiang, Deyang and Zheng, Liming and Huang, Jing and Qiu, Haibo and Shi, Peng and Yang, Siqi and others},
journal={arXiv preprint arXiv:2601.08834},
year={2025}
}