Towards Improving Document Understanding: An Exploration on Text-Grounding via MLLMs
November 28, 2024 ยท View on GitHub
TGDoc: Text-Grounding Document Understanding with MLLMs
๐ Overview
TGDoc is a model that enhances document understanding by implementing text-grounding capabilities in Multimodal Large Language Models (MLLMs). Our approach improves the ability to process and comprehend document-based information.
๐ ๏ธ Quick Start
Prerequisites
This project builds upon LLaVA. Set up your environment:
git clone https://github.com/haotian-liu/LLaVA.git
cd LLaVA
conda create -n llava python=3.10 -y
conda activate llava
pip install --upgrade pip
pip install -e .
pip install flash-attn --no-build-isolation
Dependencies
deepspeed==0.9.5
peft==0.4.0
transformers==4.31.0
accelerate==0.21.0
bitsandbytes==0.41.0
Resources
Pretrained Models
| Model Name | Download Link | Access Code |
|---|---|---|
| tgdoc-7b-finetune-336 | Download | xhif |
Dataset
- Complete dataset: Download (Access Code: gxqt)
- Includes LLaVA and LLaVAR datasets
Dataset Examples
-
PPT Data Examples:

-
CC3M Finetuning Examples:

Usage
Training
- Configure settings in
llava/data/config.py - Select appropriate training script:
bash scripts/pretrain.shandbash scripts/finetune.sh
Inference
python test.py
We use MultimodalOCR for validation, appending each question with: "Support your reasoning with the coordinates [xmin, ymin, xmax, ymax]"
Results
Qualitative Examples

Citation
@article{wang2023towards,
title={Towards Improving Document Understanding: An Exploration on Text-Grounding via MLLMs},
author={Wang, Yonghui and Zhou, Wengang and Feng, Hao and Zhou, Keyi and Li, Houqiang},
journal={arXiv preprint arXiv:2311.13194},
year={2023}
}
Acknowledgments
This project builds upon LLaVA. We thank the original authors for their great work.