📢 News and Updates

March 23, 2026 · View on GitHub

Reading or Reasoning? Format Decoupled Reinforcement Learning for Document OCR

📢 News and Updates

2025.11.26 We have released our paper! You can read the PDF file here.
2025.11.17 We upload our model weights FD-RL to HuggingFace.

🤗 Models

Model	Download Link
FD-RL	DocTron/FD-RL

📚 Dataset

Overview of our training data construction pipeline. We construct the dataset via (1) open-source dataset quality enhancement, (2) real-world PDF construction, and (3) synthetic OCR data generation, yielding a large-scale dataset covering nine common document categories: notes, financial reports, slides, exam papers, synthetic data, magazines, academic papers, books, and newspapers. 数据集

🤖 Pipeline

Overview of the FD-RL training pipeline. Our approach comprises two stages: Stage 1 (SFT) trains FD-RL(SFT) on large-scale document data via supervised fine-tuning, freezing the vision encoder and projector while updating the large language model. Stage 2 (RL) introduces two innovations: (1) Entropy-based data filtration strategy that filters data through type-based filtering (enriching structured data and balancing languages) and entropy-based filtering (retaining high-entropy samples), and (2) Format decoupled RL using separately for different content types: string matching reward for plain text, expression correctness reward for formulas, and structural coherence reward for tables. 数据集

📊 Performance

FD-RL is evaluated on OmniDocBench v1.5, a comprehensive document OCR benchmark covering diverse real-world scenarios (e.g., office documents, academic papers, scanned materials). Results show that FD-RL achieves the best performance among end-to-end VLMs, demonstrating the effectiveness of our Format Decoupled RL approach.

End-to-End Evaluation

We conduct an end-to-end evaluation to measure the model's parsing accuracy on PDF page content. The prediction is derived from the model's Markdown output generated by parsing the entire PDF page. The Overall metric is defined as:

\text{Overall} = \frac{(1-\text{Text }^\text{Edit Distance}) \times 100 + \text{Formula}^{\text{CDM}} + \text{Table}^{\text{TEDS}}}{3}

Model Type	Methods	Release Date	End to End	Parameters	Overall↑	Text^Edit↓	Formula^CDM↑	Table^TEDS↑	Table^TEDS-S↑	Reading Order^Edit↓
Pipeline Tools	Marker-1.8.2	2025	❌	-	71.30	0.206	76.66	57.88	71.17	0.250
	Mineru2-pipeline	2025	❌	-	75.51	0.209	76.55	70.90	79.11	0.225
	PP-StructureV3	2025	❌	-	86.73	0.073	85.79	81.68	89.48	0.073
General VLMs	GPT-4o	2023	✅	-	75.02	0.217	79.70	67.07	76.09	0.148
	InternVL3-76B	2025	✅	76B	80.33	0.131	83.42	70.64	77.74	0.113
	InternVL3.5-241B	2025	✅	241B	82.67	0.142	87.23	75.00	81.28	0.125
	Qwen2.5-VL-72B	2025	✅	72B	87.02	0.094	88.27	82.15	86.22	0.102
	Gemini-2.5 Pro	2025	✅	-	88.03	0.075	85.82	85.71	90.29	0.097
Specialized VLMs	Dolphin	2025.05	❌	322M	74.67	0.125	67.85	68.70	77.77	0.124
	MinerU2-VLM	2025.06	❌	0.9B	85.56	0.078	80.95	83.54	87.66	0.086
	MonkeyOCR-pro-1.2B	2025.07	❌	1.9B	86.96	0.084	85.02	84.24	89.02	0.130
	MonkeyOCR-3B	2025.07	❌	3.7B	87.13	0.075	87.45	81.39	85.92	0.129
	MonkeyOCR-pro-3B	2025.07	❌	3.7B	88.85	0.075	87.25	86.78	90.63	0.128
	MinerU2.5	2025.09	❌	1.2B	90.67	0.047	88.46	88.22	92.38	0.044
	PaddleOCR-VL	2025.10	❌	0.9B	92.56	0.035	91.43	89.76	93.52	0.043
	OCRFlux-3B	2025.06	✅	3B	74.82	0.193	68.03	75.75	80.23	0.202
	Mistral OCR	2025.03	✅	-	78.83	0.164	82.84	70.03	78.04	0.144
	POINTS-Reader	2025.08	✅	3B	80.98	0.134	79.20	77.13	81.66	0.145
	olmOCR-7B	2025.02	✅	7B	81.79	0.096	86.04	68.92	74.77	0.121
	Nanonets-OCR-s	2025.06	✅	3B	85.59	0.093	85.90	80.14	85.57	0.108
	Deepseek-OCR	2025.10	✅	3B	87.01	0.073	83.37	84.97	88.80	0.086
	dots.ocr	2025.07	✅	3B	88.41	0.048	83.22	86.78	90.62	0.053
	FD-RL	2025.11	✅	4B	90.41	0.049	88.67	87.35	92.10	0.055

Performance Across Diverse Page Types

To better understand the FD-RL model's capabilities across different document scenarios, the following table reports its text recognition performance (Edit Distance) on 9 document types. This category-wise analysis helps identify the model's strengths and weaknesses in handling diverse real-world documents.

Model Type	Models	End to End	Slides	Academic Papers	Book	Textbook	Exam Papers	Magazine	Newspaper	Notes	Financial Report
Pipeline Tools	Marker-1.8.2	❌	0.1796	0.0412	0.1010	0.2908	0.2958	0.1111	0.2717	0.4656	0.0341
	MinerU2-pipeline	❌	0.4244	0.0230	0.2628	0.1224	0.0822	0.395	0.0736	0.2603	0.0411
	PP-StructureV3	❌	0.0794	0.0236	0.0415	0.1107	0.0945	0.0722	0.0617	0.1236	0.0181
General VLMs	GPT-4o	✅	0.1019	0.1203	0.1288	0.1599	0.1939	0.142	0.6254	0.2611	0.3343
	InternVL3-76B	✅	0.0349	0.1052	0.0629	0.0827	0.1007	0.0406	0.5826	0.0924	0.0665
	InternVL3.5-241B	✅	0.0475	0.0857	0.0237	0.1061	0.0933	0.0577	0.6403	0.1357	0.1117
	Qwen2.5-VL-72B	✅	0.0422	0.0801	0.0586	0.1146	0.0681	0.0964	0.238	0.1232	0.0264
	Gemini-2.5 Pro	✅	0.0326	0.0182	0.0694	0.1618	0.0937	0.0161	0.1347	0.1169	0.0169
Specialized VLMs	Dolphin	❌	0.0957	0.0453	0.0616	0.1333	0.1684	0.0702	0.2388	0.2561	0.0186
	MinerU2-VLM	❌	0.0745	0.0104	0.0357	0.1276	0.0698	0.0652	0.1831	0.0803	0.0236
	MonkeyOCR-pro-1.2B	❌	0.0961	0.0354	0.053	0.111	0.0887	0.0494	0.0995	0.1686	0.0198
	MonkeyOCR-pro-3B	❌	0.0904	0.0362	0.0489	0.1072	0.0745	0.0475	0.0962	0.1165	0.0196
	MinerU2.5	❌	0.0294	0.0235	0.0332	0.0499	0.0681	0.0316	0.054	0.1161	0.0104
	OCRFlux	✅	0.0870	0.0867	0.0818	0.1843	0.2072	0.1048	0.7304	0.1567	0.0193
	Mistral-OCR	✅	0.0917	0.0531	0.0610	0.1341	0.1341	0.0581	0.5643	0.3097	0.0523
	POINTS-Reader	✅	0.0334	0.0779	0.0671	0.1372	0.1901	0.1343	0.3789	0.0937	0.0951
	olmOCR-7B	✅	0.0497	0.0365	0.0539	0.1204	0.0728	0.0697	0.2916	0.122	0.0459
	Nanonets-OCR-s	✅	0.0551	0.0578	0.0606	0.0931	0.0834	0.0917	0.1965	0.1606	0.0395
	dots.ocr	✅	0.0290	0.0231	0.0433	0.0788	0.0467	0.0221	0.0667	0.1116	0.0076
	FD-RL	✅	0.0235	0.0258	0.0300	0.0867	0.0464	0.0235	0.1069	0.0881	0.0091

🔍 Usage Example

Inference

This below is a simple example of how to use FD-RL for document parsing tasks.

Using 🤗 Transformers to Chat

Please first install transformers using the following command:

pip install "transformers>=4.57.0"

from transformers import Qwen3VLForConditionalGeneration, AutoProcessor
import torch

# Load model
model_path = 'DocTron/FD-RL'
model = Qwen3VLForConditionalGeneration.from_pretrained(
    model_path,
    dtype="auto", 
    device_map="cuda",
    trust_remote_code=True
)
processor = AutoProcessor.from_pretrained(model_path, trust_remote_code=True)

# Prepare input with image and text
image_path = "./assets/FD-RL_test.jpg"
# We recommend using the following prompt to better performance, since it is used throughout the training process.
prompt = "Extract the main content from the document in the image, keeping the original structure. Convert all formulas to LaTeX and all tables to HTML."

messages = [
    {
        "role": "user",
        "content": [
            {"type": "image", "image": image_path},
            {"type": "text", "text": prompt},
        ]
    }
]

# Preparation for inference
inputs = processor.apply_chat_template(
    messages, 
    tokenize=True, 
    add_generation_prompt=True,
    return_dict=True,
    return_tensors="pt"
)
inputs = inputs.to(model.device)

# Inference: Generation of the output
generated_ids = model.generate(**inputs, max_new_tokens=8192, do_sample=False)

generated_ids = [
    output_ids[len(input_ids):] for input_ids, output_ids in zip(inputs.input_ids, generated_ids)
]
output_text = processor.tokenizer.batch_decode(
    generated_ids, skip_special_tokens=True, clean_up_tokenization_spaces=False
)
print(output_text[0])

# $$
# r = \frac{\alpha}{\beta} \sin \beta (\sigma_1 \pm \sigma_2)
# $$

Online Serving

We suggest using vLLM for efficient FD-RL inference. Make sure to install vllm==0.11.0 or above for full compatibility. For detailed instructions on serving and inference with multimodal models, visit the vLLM documentation.

pip install "vllm>=0.11.0"

vLLM server

MODEL_PATH=path/to/your/model
TENSOR_PARALLEL_SIZE=depends/on/your/hardware

python3 -m vllm.entrypoints.openai.api_server \
    --model $MODEL_PATH \
    --host 0.0.0.0 \
    --port 22002 \
    --tensor-parallel-size $TENSOR_PARALLEL_SIZE \
    --max-num-seqs 80 \
    --max-model-len 16384 \
    --gpu-memory-utilization 0.9

Image Request Example

from openai import OpenAI
 
MODEL_PATH="path/to/your/model"
FILE_PATH="path/to/your/image"

client = OpenAI(
    base_url="http://127.0.0.1:22002/v1",
    api_key="EMPTY"
)

response = client.chat.completions.create(
    model=MODEL_PATH,
    messages=[
        {
            "role": "user",
            "content": [
                {
                    "type": "text",
                    "text": "Extract the main content from the document in the image, keeping the original structure. Convert all formulas to LaTeX and all tables to HTML."
                },
                {
                    "type": "image_url",
                    "image_url": {
                        "url":  f"file://{FILE_PATH}"
                    }
                }
            ]
        }
    ]
)

print(response.choices[0].message.content)

Fine-tuning

SFT

For users interested in further fine-tuning our model, we suggest employing Llama Factory. Complete installation and usage information is documented in the official repository. Below we present an example fine-tuning configuration:

PROJECT_DIR=/path/to/llama_factory
cd ${PROJECT_DIR}

# Set parameters
GPUS_PER_NODE=8                  # Number of GPUs per node
NNODES=1                         # Total number of nodes
NODE_RANK=0                      # Rank of the current node (starts from 0)
MASTER_ADDR=localhost            # IP address of the master node
MASTER_PORT=12345                # Port for communication between nodes

MODEL_DIR=/path/to/ocrverse_text_model  # Path to the pre-trained OCRVerse model
DATA=/name/of/your/dataset               # Name/path of your custom dataset
OUTPUT_DIR=/path/to/output              # Directory to save fine-tuned results

# Llama Factory-based fine-tuning script
torchrun --nproc_per_node="${GPUS_PER_NODE}" --nnodes="${NNODES}" --node_rank="${NODE_RANK}" --master_addr="${MASTER_ADDR}" --master_port="${MASTER_PORT}" \
    src/train.py \
    --model_name_or_path "$MODEL_DIR" \
    --stage sft \
    --do_train True \
    --finetuning_type full \
    --dataset "$DATA" \
    --template qwen3_vl_nothink \
    --cutoff_len 8192 \
    --preprocessing_num_workers 128 \
    --preprocessing_batch_size 256 \
    --dataloader_num_workers 128 \
    --output_dir "$OUTPUT_DIR" \
    --logging_steps 1 \
    --save_steps 5000 \
    --plot_loss True \
    --save_only_model False \
    --report_to none \
    --per_device_train_batch_size 1 \
    --gradient_accumulation_steps 1 \
    --learning_rate 1e-5 \
    --num_train_epochs 1 \
    --lr_scheduler_type cosine \
    --warmup_ratio 0.1 \
    --bf16 True

📧 Contact

For any questions or collaboration inquiries, please don't hesitate to contact me at zhongyufeng21@mails.ucas.ac.cn.

📌 Acknowledgement

We sincerely appreciate LLaMA-Factory for providing reference training framework.

📖 Citation

If you find this project useful, please feel free to leave a star and cite our paper:

@article{zhong2025reading,
  title={Reading or Reasoning? Format Decoupled Reinforcement Learning for Document OCR},
  author={Zhong, Yufeng and Chen, Lei and Zeng, Zhixiong and Zhao, Xuanle and Jiang, Deyang and Zheng, Liming and Huang, Jing and Qiu, Haibo and Shi, Peng and Yang, Siqi and others},
  journal={arXiv preprint arXiv:2601.08834},
  year={2025}
}