MMTIT-Bench

April 13, 2026 · View on GitHub

A Multilingual and Multi-Scenario Benchmark with Cognition–Perception–Reasoning Guided Text-Image Machine Translation

中文版PaperGitHubHuggingFace

Overview

MMTIT-Bench is a human-verified benchmark for end-to-end Text-Image Machine Translation (TIMT). It contains 1,400 images spanning 14 non-English and non-Chinese languages across diverse real-world scenarios, with bilingual (Chinese & English) translation annotations.

We also propose CPR-Trans (Cognition–Perception–Reasoning for Translation), a reasoning-oriented data paradigm that unifies scene cognition, text perception, and translation reasoning within a structured chain-of-thought framework.

MMTIT-Bench Overview

Benchmark Statistics

ItemDetails
Total Images1,400
Languages14 (AR, DE, ES, FR, ID, IT, JA, KO, MS, PT, RU, TH, TR, VI)
Translation DirectionsOther→Chinese, Other→English
ScenariosDocuments, Menus, Books, Attractions, Posters, Commodities, etc.
AnnotationHuman-verified OCR + Bilingual translations

Data Format

Directory Structure

MMTIT-Bench/
├── README.md
├── README_ZH.md
├── annotation.jsonl        # Benchmark annotations
├── images.zip              # Benchmark images
├── eval_comet_demo.py      # COMET evaluation script
└── prediction_demo.jsonl   # Example prediction file

Annotation (annotation.jsonl)

Each line is a JSON object:

{
    "image_id": "Korea_Menu_20843.jpg",
    "parsing_anno": "멜로우스트리트\n\n위치: 서울특별시 관악구...",
    "translation_zh": "梅尔街\n\n位置:首尔特别市 冠岳区...",
    "translation_en": "Mellow Street\n\nLocation: 1st Floor, 104 Gwanak-ro..."
}
FieldDescription
image_idImage filename, formatted as {Language}_{Scenario}_{ID}.jpg
parsing_annoOCR text parsing annotation (source language)
translation_zhChinese translation
translation_enEnglish translation

Prediction File

Your prediction file should be a JSONL with the following fields:

{"image_id": "Korea_Menu_20843.jpg", "pred": "Your model's translation output"}

Evaluation

We use COMET (Unbabel/wmt22-comet-da) as the rule-based evaluation metric.

Install

pip install unbabel-comet

Run

# Other → Chinese
python eval_comet_demo.py \
    --prediction your_prediction.jsonl \
    --annotation annotation.jsonl \
    --direction other2zh \
    --batch_size 16 --gpus 0

# Other → English
python eval_comet_demo.py \
    --prediction your_prediction.jsonl \
    --annotation annotation.jsonl \
    --direction other2en \
    --batch_size 16 --gpus 1

Arguments

ArgumentDefaultDescription
--prediction(required)Path to your prediction JSONL
--annotationannotation.jsonlPath to benchmark annotations
--direction(required)other2zh or other2en
--batch_size16Batch size for inference
--gpus0Number of GPUs (0 = CPU)
--outputcomet_results_{direction}.jsonlOutput path for per-sample scores

Citation

@misc{li2026mmtitbench,
      title={MMTIT-Bench: A Multilingual and Multi-Scenario Benchmark with Cognition-Perception-Reasoning Guided Text-Image Machine Translation},
      author={Gengluo Li and Chengquan Zhang and Yupu Liang and Huawen Shen and Yaping Zhang and Pengyuan Lyu and Weinong Wang and Xingyu Wan and Gangyan Zeng and Han Hu and Can Ma and Yu Zhou},
      year={2026},
      journal={arXiv preprint arXiv:2603.23896},
      url={https://arxiv.org/abs/2603.23896},
}

License

This benchmark is released for research purposes only.