TextHawk: πŸ₯‡ LVLM with 16x Compression Ratio

November 1, 2024 Β· View on GitHub

arXiv arXiv ZhiHu

Base Models

TextHawk2: A Large Vision-Language Model Excels in Bilingual OCR and Grounding with 16x Fewer Tokens

TextHawk: Efficient Fine-Grained Perception of Multimodal Large Language Models

GUI Agents

UI-Hawk: Unleashing the Screen Stream Understanding for GUI Agents

Introduction

The TextHawk series represents a cutting-edge family of Large Vision-Language Models (LVLMs) designed for highly efficient fine-grained perception. Notably, TextHawk sets a milestone as the first LVLM to achieve a 16x token compression ratio. This is made possible through the integration of four key components:

  • Scalable Positional Embeddings (SPEs)
  • Query Proposal Network (QPN)
  • ReSampling and ReArrangement (ReSA)
  • Multi-Level Cross-Attention (MLCA)

architecture

Building on the same architecture, TextHawk2 enhances performance by leveraging greater data diversity and reinforcing the visual encoder. This iteration achieves state-of-the-art results across multiple benchmarks, excelling in tasks related to general multimodal understanding, Optical Character Recognition (OCR), and visual grounding.

For instance, TextHawk2 delivers impressive metrics such as 78.4% accuracy on OCRBench, 81.4% accuracy on ChartQA, 89.6% ANLS on DocVQA, and 88.1% accuracy@0.5 on RefCOCOg-test.

compression

TextHawk series can compress multiple times more words displayed on a small image, where each character measures under 8 pixels, into a few tokens, allowing for accurate recovery. It’s reminiscent of the futuristic gadgets in Doraemon anime.

examples

DocGemini

We create a new instruction-tuning dataset DocGemini for document-oriented tasks by enriching the multimodal document data with Gemini Pro. Each data sample contains:

  • A brief summary of the document topics.
  • Short QA pairs, up to 10.
  • Insights behind each answer.
  • [Optional] An imaginary conversations between two researchers.

DocGemini consists of 30K images and 195K QA pairs with insights.

DatasetQAConversation
DocVQAlinklink
ChartQAlinklink
InfoVQAlinklink

Note: Alternatively, you can produce data on your own using the scripts we provide.

Benchmarks

ocr

grounding

proprietary

TextHawk
ModelViT
(Params.)
MME
perception
MMB
dev
SEED
image
GQADocVQAChartQAInfoVQATabFactWTQRefCOCO
val
RefCOCO
test-A
RefCOCO
test-B
Donut\text{Donut}Swin-B\text{Swin-B}
(0.1B)
----67.541.811.654.618.8---
Pix2Struct\text{Pix2Struct}-----76.658.640.0-----
InternLM-XC\text{InternLM-XC}EVA-G\text{EVA-G}
(1B)
1528.474.866.1---------
LLaVA-1.5-7B\text{LLaVA-1.5-7B}CLIP-L\text{CLIP-L}
(0.3B)
1510.765.2-62.0--------
Shikra-7B\text{Shikra-7B}CLIP-L\text{CLIP-L}
(0.3B)
-58.8-------87.091.181.8
Qwen-VL-Chat\text{Qwen-VL-Chat}CLIP-G\text{CLIP-G}
(2B)
1487.660.665.457.562.666.3---88.692.384.5
Monkey\text{Monkey}CLIP-G\text{CLIP-G}
(2B)
-59.3-60.766.565.136.1-25.3---
UReader\text{UReader}CLIP-L\text{CLIP-L}
(0.3B)
----65.459.342.267.629.4---
TextMonkey\text{TextMonkey}CLIP-G\text{CLIP-G}
(2B)
----73.066.9--31.9---
TextHawkβˆ—\textbf{TextHawk}^*SigLIP-SO\text{SigLIP-SO}
(0.4B)
1520.973.069.264.773.664.047.370.733.587.390.983.3
TextHawk\textbf{TextHawk}SigLIP-SO\text{SigLIP-SO}
(0.4B)
1500.074.669.264.676.466.650.671.134.787.290.882.5

Note: TextHawkβˆ—\textbf{TextHawk}^* is fine-tuned without the DocGemini.

Visualization

markdown

reg

BibTex

@article{yu24texthawk2,
  author       = {Ya{-}Qi Yu and Minghui Liao and Jiwen Zhang and Jihao Wu},
  title        = {TextHawk2: A Large Vision-Language Model Excels in Bilingual OCR and Grounding with 16x Fewer Tokens},
  journal      = {CoRR},
  volume       = {abs/2410.05261},
  year         = {2024}
}
@article{yu24texthawk,
  author       = {Ya{-}Qi Yu and Minghui Liao and Jihao Wu and Yongxin Liao and Xiaoyu Zheng and Wei Zeng},
  title        = {TextHawk: Exploring Efficient Fine-Grained Perception of Multimodal Large Language Models},
  journal      = {CoRR},
  volume       = {abs/2404.09204},
  year         = {2024}
}
@article{zhang24uihawk,
  title        = {{UI-Hawk}: Unleashing the Screen Stream Understanding for GUI Agents},
  author       = {Jiwen Zhang and Yaqi Yu and Minghui Liao and Wentao Li and Jihao Wu and Zhongyu Wei},
  journal      = {Preprints},
  volume       = {manuscript/202408.2137},
  year         = {2024}
}