๐Ÿ“ข News and Updates

March 23, 2026 ยท View on GitHub

Reading or Reasoning? Format Decoupled Reinforcement Learning for Document OCR

โ€‚ โ€‚

๐Ÿ“ข News and Updates

  • 2025.11.26 We have released our paper! You can read the PDF file here.
  • 2025.11.17 We upload our model weights FD-RL to HuggingFace.

๐Ÿค— Models

ModelDownload Link
FD-RLDocTron/FD-RL

๐Ÿ“š Dataset

Overview of our training data construction pipeline. We construct the dataset via (1) open-source dataset quality enhancement, (2) real-world PDF construction, and (3) synthetic OCR data generation, yielding a large-scale dataset covering nine common document categories: notes, financial reports, slides, exam papers, synthetic data, magazines, academic papers, books, and newspapers. ๆ•ฐๆฎ้›†

๐Ÿค– Pipeline

Overview of the FD-RL training pipeline. Our approach comprises two stages: Stage 1 (SFT) trains FD-RL(SFT) on large-scale document data via supervised fine-tuning, freezing the vision encoder and projector while updating the large language model. Stage 2 (RL) introduces two innovations: (1) Entropy-based data filtration strategy that filters data through type-based filtering (enriching structured data and balancing languages) and entropy-based filtering (retaining high-entropy samples), and (2) Format decoupled RL using separately for different content types: string matching reward for plain text, expression correctness reward for formulas, and structural coherence reward for tables. ๆ•ฐๆฎ้›†

๐Ÿ“Š Performance

FD-RL is evaluated on OmniDocBench v1.5, a comprehensive document OCR benchmark covering diverse real-world scenarios (e.g., office documents, academic papers, scanned materials). Results show that FD-RL achieves the best performance among end-to-end VLMs, demonstrating the effectiveness of our Format Decoupled RL approach.

End-to-End Evaluation

We conduct an end-to-end evaluation to measure the model's parsing accuracy on PDF page content. The prediction is derived from the model's Markdown output generated by parsing the entire PDF page. The Overall metric is defined as:

Overall=(1โˆ’Textย Editย Distance)ร—100+FormulaCDM+TableTEDS3\text{Overall} = \frac{(1-\text{Text }^\text{Edit Distance}) \times 100 + \text{Formula}^{\text{CDM}} + \text{Table}^{\text{TEDS}}}{3}
Model Type Methods Release Date End to End Parameters Overallโ†‘ TextEditโ†“ FormulaCDMโ†‘ TableTEDSโ†‘ TableTEDS-Sโ†‘ Reading OrderEditโ†“
Pipeline Tools Marker-1.8.2 2025 โŒ - 71.30 0.206 76.66 57.88 71.17 0.250
Mineru2-pipeline 2025 โŒ - 75.51 0.209 76.55 70.90 79.11 0.225
PP-StructureV3 2025 โŒ - 86.73 0.073 85.79 81.68 89.48 0.073
General VLMs GPT-4o 2023 โœ… - 75.02 0.217 79.70 67.07 76.09 0.148
InternVL3-76B 2025 โœ… 76B 80.33 0.131 83.42 70.64 77.74 0.113
InternVL3.5-241B 2025 โœ… 241B 82.67 0.142 87.23 75.00 81.28 0.125
Qwen2.5-VL-72B 2025 โœ… 72B 87.02 0.094 88.27 82.15 86.22 0.102
Gemini-2.5 Pro 2025 โœ… - 88.03 0.075 85.82 85.71 90.29 0.097
Specialized VLMs Dolphin 2025.05 โŒ 322M 74.67 0.125 67.85 68.70 77.77 0.124
MinerU2-VLM 2025.06 โŒ 0.9B 85.56 0.078 80.95 83.54 87.66 0.086
MonkeyOCR-pro-1.2B 2025.07 โŒ 1.9B 86.96 0.084 85.02 84.24 89.02 0.130
MonkeyOCR-3B 2025.07 โŒ 3.7B 87.13 0.075 87.45 81.39 85.92 0.129
MonkeyOCR-pro-3B 2025.07 โŒ 3.7B 88.85 0.075 87.25 86.78 90.63 0.128
MinerU2.5 2025.09 โŒ 1.2B 90.67 0.047 88.46 88.22 92.38 0.044
PaddleOCR-VL 2025.10 โŒ 0.9B 92.56 0.035 91.43 89.76 93.52 0.043
OCRFlux-3B 2025.06 โœ… 3B 74.82 0.193 68.03 75.75 80.23 0.202
Mistral OCR 2025.03 โœ… - 78.83 0.164 82.84 70.03 78.04 0.144
POINTS-Reader 2025.08 โœ… 3B 80.98 0.134 79.20 77.13 81.66 0.145
olmOCR-7B 2025.02 โœ… 7B 81.79 0.096 86.04 68.92 74.77 0.121
Nanonets-OCR-s 2025.06 โœ… 3B 85.59 0.093 85.90 80.14 85.57 0.108
Deepseek-OCR 2025.10 โœ… 3B 87.01 0.073 83.37 84.97 88.80 0.086
dots.ocr 2025.07 โœ… 3B 88.41 0.048 83.22 86.78 90.62 0.053
FD-RL 2025.11 โœ… 4B 90.41 0.049 88.67 87.35 92.10 0.055

Performance Across Diverse Page Types

To better understand the FD-RL model's capabilities across different document scenarios, the following table reports its text recognition performance (Edit Distance) on 9 document types. This category-wise analysis helps identify the model's strengths and weaknesses in handling diverse real-world documents.

Model Type Models End to End Slides Academic Papers Book Textbook Exam Papers Magazine Newspaper Notes Financial Report
Pipeline Tools Marker-1.8.2 โŒ 0.1796 0.0412 0.1010 0.2908 0.2958 0.1111 0.2717 0.4656 0.0341
MinerU2-pipeline โŒ 0.4244 0.0230 0.2628 0.1224 0.0822 0.395 0.0736 0.2603 0.0411
PP-StructureV3 โŒ 0.0794 0.0236 0.0415 0.1107 0.0945 0.0722 0.0617 0.1236 0.0181
General VLMs GPT-4o โœ… 0.1019 0.1203 0.1288 0.1599 0.1939 0.142 0.6254 0.2611 0.3343
InternVL3-76B โœ… 0.0349 0.1052 0.0629 0.0827 0.1007 0.0406 0.5826 0.0924 0.0665
InternVL3.5-241B โœ… 0.0475 0.0857 0.0237 0.1061 0.0933 0.0577 0.6403 0.1357 0.1117
Qwen2.5-VL-72B โœ… 0.0422 0.0801 0.0586 0.1146 0.0681 0.0964 0.238 0.1232 0.0264
Gemini-2.5 Pro โœ… 0.0326 0.0182 0.0694 0.1618 0.0937 0.0161 0.1347 0.1169 0.0169
Specialized VLMs Dolphin โŒ 0.0957 0.0453 0.0616 0.1333 0.1684 0.0702 0.2388 0.2561 0.0186
MinerU2-VLM โŒ 0.0745 0.0104 0.0357 0.1276 0.0698 0.0652 0.1831 0.0803 0.0236
MonkeyOCR-pro-1.2B โŒ 0.0961 0.0354 0.053 0.111 0.0887 0.0494 0.0995 0.1686 0.0198
MonkeyOCR-pro-3B โŒ 0.0904 0.0362 0.0489 0.1072 0.0745 0.0475 0.0962 0.1165 0.0196
MinerU2.5 โŒ 0.0294 0.0235 0.0332 0.0499 0.0681 0.0316 0.054 0.1161 0.0104
OCRFlux โœ… 0.0870 0.0867 0.0818 0.1843 0.2072 0.1048 0.7304 0.1567 0.0193
Mistral-OCR โœ… 0.0917 0.0531 0.0610 0.1341 0.1341 0.0581 0.5643 0.3097 0.0523
POINTS-Reader โœ… 0.0334 0.0779 0.0671 0.1372 0.1901 0.1343 0.3789 0.0937 0.0951
olmOCR-7B โœ… 0.0497 0.0365 0.0539 0.1204 0.0728 0.0697 0.2916 0.122 0.0459
Nanonets-OCR-s โœ… 0.0551 0.0578 0.0606 0.0931 0.0834 0.0917 0.1965 0.1606 0.0395
dots.ocr โœ… 0.0290 0.0231 0.0433 0.0788 0.0467 0.0221 0.0667 0.1116 0.0076
FD-RL โœ… 0.0235 0.0258 0.0300 0.0867 0.0464 0.0235 0.1069 0.0881 0.0091

๐Ÿ” Usage Example

Inference

This below is a simple example of how to use FD-RL for document parsing tasks.

Using ๐Ÿค— Transformers to Chat

Please first install transformers using the following command:

pip install "transformers>=4.57.0"
from transformers import Qwen3VLForConditionalGeneration, AutoProcessor
import torch

# Load model
model_path = 'DocTron/FD-RL'
model = Qwen3VLForConditionalGeneration.from_pretrained(
    model_path,
    dtype="auto", 
    device_map="cuda",
    trust_remote_code=True
)
processor = AutoProcessor.from_pretrained(model_path, trust_remote_code=True)

# Prepare input with image and text
image_path = "./assets/FD-RL_test.jpg"
# We recommend using the following prompt to better performance, since it is used throughout the training process.
prompt = "Extract the main content from the document in the image, keeping the original structure. Convert all formulas to LaTeX and all tables to HTML."

messages = [
    {
        "role": "user",
        "content": [
            {"type": "image", "image": image_path},
            {"type": "text", "text": prompt},
        ]
    }
]

# Preparation for inference
inputs = processor.apply_chat_template(
    messages, 
    tokenize=True, 
    add_generation_prompt=True,
    return_dict=True,
    return_tensors="pt"
)
inputs = inputs.to(model.device)

# Inference: Generation of the output
generated_ids = model.generate(**inputs, max_new_tokens=8192, do_sample=False)

generated_ids = [
    output_ids[len(input_ids):] for input_ids, output_ids in zip(inputs.input_ids, generated_ids)
]
output_text = processor.tokenizer.batch_decode(
    generated_ids, skip_special_tokens=True, clean_up_tokenization_spaces=False
)
print(output_text[0])

# $$
# r = \frac{\alpha}{\beta} \sin \beta (\sigma_1 \pm \sigma_2)
# $$

Online Serving

We suggest using vLLM for efficient FD-RL inference. Make sure to install vllm==0.11.0 or above for full compatibility. For detailed instructions on serving and inference with multimodal models, visit the vLLM documentation.

pip install "vllm>=0.11.0"
  • vLLM server
MODEL_PATH=path/to/your/model
TENSOR_PARALLEL_SIZE=depends/on/your/hardware

python3 -m vllm.entrypoints.openai.api_server \
    --model $MODEL_PATH \
    --host 0.0.0.0 \
    --port 22002 \
    --tensor-parallel-size $TENSOR_PARALLEL_SIZE \
    --max-num-seqs 80 \
    --max-model-len 16384 \
    --gpu-memory-utilization 0.9 
  • Image Request Example
from openai import OpenAI
 
MODEL_PATH="path/to/your/model"
FILE_PATH="path/to/your/image"

client = OpenAI(
    base_url="http://127.0.0.1:22002/v1",
    api_key="EMPTY"
)

response = client.chat.completions.create(
    model=MODEL_PATH,
    messages=[
        {
            "role": "user",
            "content": [
                {
                    "type": "text",
                    "text": "Extract the main content from the document in the image, keeping the original structure. Convert all formulas to LaTeX and all tables to HTML."
                },
                {
                    "type": "image_url",
                    "image_url": {
                        "url":  f"file://{FILE_PATH}"
                    }
                }
            ]
        }
    ]
)

print(response.choices[0].message.content)

Fine-tuning

SFT

For users interested in further fine-tuning our model, we suggest employing Llama Factory. Complete installation and usage information is documented in the official repository. Below we present an example fine-tuning configuration:

PROJECT_DIR=/path/to/llama_factory
cd ${PROJECT_DIR}

# Set parameters
GPUS_PER_NODE=8                  # Number of GPUs per node
NNODES=1                         # Total number of nodes
NODE_RANK=0                      # Rank of the current node (starts from 0)
MASTER_ADDR=localhost            # IP address of the master node
MASTER_PORT=12345                # Port for communication between nodes

MODEL_DIR=/path/to/ocrverse_text_model  # Path to the pre-trained OCRVerse model
DATA=/name/of/your/dataset               # Name/path of your custom dataset
OUTPUT_DIR=/path/to/output              # Directory to save fine-tuned results

# Llama Factory-based fine-tuning script
torchrun --nproc_per_node="${GPUS_PER_NODE}" --nnodes="${NNODES}" --node_rank="${NODE_RANK}" --master_addr="${MASTER_ADDR}" --master_port="${MASTER_PORT}" \
    src/train.py \
    --model_name_or_path "$MODEL_DIR" \
    --stage sft \
    --do_train True \
    --finetuning_type full \
    --dataset "$DATA" \
    --template qwen3_vl_nothink \
    --cutoff_len 8192 \
    --preprocessing_num_workers 128 \
    --preprocessing_batch_size 256 \
    --dataloader_num_workers 128 \
    --output_dir "$OUTPUT_DIR" \
    --logging_steps 1 \
    --save_steps 5000 \
    --plot_loss True \
    --save_only_model False \
    --report_to none \
    --per_device_train_batch_size 1 \
    --gradient_accumulation_steps 1 \
    --learning_rate 1e-5 \
    --num_train_epochs 1 \
    --lr_scheduler_type cosine \
    --warmup_ratio 0.1 \
    --bf16 True

๐Ÿ“ง Contact

For any questions or collaboration inquiries, please don't hesitate to contact me at zhongyufeng21@mails.ucas.ac.cn.

๐Ÿ“Œ Acknowledgement

We sincerely appreciate LLaMA-Factory for providing reference training framework.

๐Ÿ“– Citation

If you find this project useful, please feel free to leave a star and cite our paper:

@article{zhong2025reading,
  title={Reading or Reasoning? Format Decoupled Reinforcement Learning for Document OCR},
  author={Zhong, Yufeng and Chen, Lei and Zeng, Zhixiong and Zhao, Xuanle and Jiang, Deyang and Zheng, Liming and Huang, Jing and Qiu, Haibo and Shi, Peng and Yang, Siqi and others},
  journal={arXiv preprint arXiv:2601.08834},
  year={2025}
}