INF-MLLM

May 11, 2026 ยท View on GitHub

Introduction

INF-MLLM is a series of open-source multimodal large language models developed by INF Tech. This repository contains the code, models, and documentation for our projects, which aim to advance the state-of-the-art in visual-language understanding and document intelligence. We are committed to open research and have released our models and datasets to the community to foster collaboration and innovation.

Updates

Models

Here is a brief overview of the models available in this repository. For more details, please refer to the respective project directories.

Infinity-Parser2

Infinity-Parser2 is our latest flagship document parsing model, offering two distinct variants: Infinity-Parser2-Pro optimized for maximum accuracy, and Infinity-Parser2-Flash engineered for high-speed inference (3.68x faster than Infinity-Parser-7B). Built on an upgraded data engine supporting nearly 5 million diverse document samples and a novel multi-task reinforcement learning framework with joint verification rewards, Infinity-Parser2 achieves state-of-the-art results on olmOCR-Bench (87.6%) and ParseBench (74.3%), surpassing frontier models including DeepSeek-OCR-2, PaddleOCR-VL-1.5, and MinerU-2.5.

Infinity-Parser

Infinity-Parser is an end-to-end scanned document parsing model trained with reinforcement learning. It is designed to maintain the original document's structure and content with high fidelity by incorporating verifiable rewards based on layout and content. Infinity-Parser demonstrates state-of-the-art performance on various benchmarks for text recognition, table and formula extraction, and reading-order detection.

VL-Rethinker

VL-Rethinker is a project designed to incentivize the self-reflection capabilities of Vision-Language Models (VLMs) through Reinforcement Learning. The research introduces a novel technique called Selective Sample Replay (SSR) to enhance the GRPO algorithm, addressing the "vanishing advantages" problem. It also employs "Forced Rethinking" to explicitly guide the model through a self-reflection reasoning step. By combining these methods, VL-Rethinker significantly advances the state-of-the-art performance on multiple vision-language benchmarks, including MathVista, MathVerse, and MathVision.

INF-MLLM2

INF-MLLM2 is an advanced multimodal model with significant improvements in high-resolution image processing and document understanding. It supports dynamic image resolutions up to 1344x1344 pixels and features enhanced OCR capabilities for robust document parsing, table and formula recognition, and key information extraction.

INF-MLLM1

INF-MLLM1 is a unified model for a wide range of visual-language tasks. It is designed to handle both multitask and instruction-tuning scenarios, demonstrating strong performance on various VQA and visual grounding datasets.