TextEdit: A High-Quality, Multi-Scenario Text Editing Benchmark for Generation Models

March 16, 2026 ยท View on GitHub

TextEdit: A High-Quality, Multi-Scenario Text Editing Benchmark for Generation Models

2603.09877 data img/data

Danni Yang, Sitao Chen, Changyao Tian

If you find our work helpful, please give us a โญ or cite our paper. See the InternVL-U technical report appendix for more details.

๐ŸŽ‰ News

  • [2026/03/06] TextEdit benchmark released. Try it out and see how well your model performs on challenging text editing tasks!
  • [2026/03/06] Evaluation code released.
  • [2026/03/06] Leaderboard updated with latest models.

๐Ÿ“ Todo

We will release an enhanced version of the TextEdit benchmark next month. Stay tuned!

๐Ÿ“– Introduction

Text editing is a fundamental yet challenging capability for modern image generation and editing models. An increasing number of powerful multimodal generation models, such as Qwen-Image and Nano-Banana-Pro, are emerging with strong text rendering and editing capabilities. For text editing task, unlike general image editing, text manipulation requires:
  • Precise spatial alignment
  • Font and style consistency
  • Background preservation
  • Layout-constrained reasoning

We introduce TextEdit, a high-quality, multi-scenario benchmark designed to evaluate fine-grained text editing capabilities in image generation models.

TextEdit covers a diverse set of real-world and virtual scenarios, spanning 18 subcategories with a total of 2,148 high-quality source images and manually annotated edited ground-truth images.

To comprehensively assess model performance, we combine classic OCR, image-fidelity metrics and modern multimodal LLM-based evaluation across target accuracy, text preservation, scene integrity, local realism and visual coherence. This dual-track protocol enables comprehensive assessment.

Our goal is to provide a standardized, realistic, and scalable benchmark for text editing research.


๐ŸŽจ Visualization Ouput Example

๐Ÿ† LeadBoard

๐Ÿ“Š Full Benchmark Results
Models # Params Real Virtual
OA OP OR F1 NED CLIP AES OA OP OR F1 NED CLIP AES
Generation Models
Qwen-Image-Edit 20B 0.750.680.660.670.710.755.72 0.780.750.730.740.750.815.21
GPT-Image-1.5 - 0.740.690.670.680.680.755.78 0.730.720.710.710.700.805.28
Nano Banana Pro - 0.770.720.700.710.720.755.79 0.800.780.770.780.780.815.28
Unified Models
Lumina-DiMOO 8B 0.220.230.190.200.190.695.53 0.220.250.210.220.200.724.76
Ovis-U1 2.4B+1.2B 0.400.370.340.350.350.725.32 0.370.400.380.390.330.754.66
BAGEL 7B+7B 0.600.590.530.550.550.745.71 0.570.600.560.570.540.785.19
InternVL-U 2B+1.7B 0.770.730.700.710.720.755.70 0.790.770.750.750.770.805.12
Models # Params Real Virtual
TA TP SI LR VC Avg TA TP SI LR VC Avg
Generation Models
Qwen-Image-Edit 20B 0.920.820.750.570.800.77 0.570.790.920.800.770.77
GPT-Image-1.5 - 0.960.940.860.800.930.90 0.820.930.960.910.870.90
Nano Banana Pro - 0.960.950.850.880.930.91 0.870.920.960.940.890.92
Unified Models
Lumina-DiMOO 8B 0.170.060.040.020.050.09 0.020.060.160.050.030.08
Ovis-U1 2.4B+1.2B 0.310.120.120.070.180.18 0.060.160.310.140.130.19
BAGEL 7B+7B 0.680.600.380.350.560.53 0.380.510.680.620.420.54
InternVL-U 2B+1.7B 0.940.900.710.800.800.88 0.870.860.910.820.620.83
๐Ÿ“Š Mini-set Benchmark Results(500 samples)
Models # Params Real Virtual
OA OP OR F1 NED CLIP AES OA OP OR F1 NED CLIP AES
Generation Models
Qwen-Image-Edit 20B 0.760.690.670.670.700.755.81 0.740.710.700.700.700.805.27
GPT-Image-1.5 - 0.720.680.660.670.670.755.85 0.680.690.680.680.650.805.32
Nano Banana Pro - 0.760.710.690.700.700.755.86 0.770.760.750.750.760.815.32
Unified Models
Lumina-DiMOO 8B 0.200.220.180.190.190.705.58 0.220.250.210.220.190.734.87
Ovis-U1 2.4B+1.2B 0.370.340.320.320.330.725.39 0.390.410.380.390.330.744.75
BAGEL 7B+7B 0.610.590.520.540.540.745.79 0.530.580.530.550.510.785.25
InternVL-U 2B+1.7B 0.770.740.700.710.710.765.79 0.740.720.690.700.720.795.14
Models # Params Real Virtual
TA TP SI LR VC Avg TA TP SI LR VC Avg
Generation Models
Qwen-Image-Edit 20B 0.930.850.770.550.780.80 0.600.820.910.810.740.76
GPT-Image-1.5 - 0.970.940.860.790.920.91 0.850.930.950.920.830.88
Nano Banana Pro - 0.960.950.850.860.920.91 0.870.920.960.930.870.92
Unified Models
Lumina-DiMOO 8B 0.160.040.040.020.060.08 0.020.050.190.070.030.10
Ovis-U1 2.4B+1.2B 0.290.110.110.080.200.17 0.040.160.350.180.150.22
BAGEL 7B+7B 0.680.610.380.340.590.53 0.360.520.690.640.400.54
InternVL-U 2B+1.7B 0.940.910.720.730.750.89 0.880.870.900.780.570.79

๐Ÿ› ๏ธ Quick Start

๐Ÿ“‚ 1. Data Preparation

You can download images from this page. The TextEdit benchmark data is organized under data/ by and category:

  • Virtual (categories 1.x.x): Synthetic/virtual scene images
  • Real (categories 2.x): Real-world scene images

Evaluation prompts are provided under eval_prompts/ in two subsets:

SubsetDirectoryDescription
Fullseteval_prompts/fullset/Complete benchmark with all samples
Miniset (500)eval_prompts/miniset/500-sample subset uniformly sampled from the fullset

Each .jsonl file contains per-sample fields: id, prompt, original_image, gt_image, source_text, target_text, gt_caption.

๐Ÿค– 2. Model Output Preparation

You need to use your model to perform image editing inference process. Please organize the outputs in the folder structure shown below to facilitate evaluation.

output/
โ”œโ”€โ”€ internvl-u/                      # Your Model Name
โ”‚   โ”œโ”€โ”€ 1.1.1                        # Category Name
โ”‚       โ”œโ”€โ”€ 1007088003726.0.jpg      # Model Output Images
โ”‚       โ”œโ”€โ”€ 1013932004096.0.jpg          
โ”‚       โ”œโ”€โ”€ ...     
โ”‚   โ”œโ”€โ”€ 1.1.2  
โ”‚   โ”œโ”€โ”€ 1.1.3             
โ”‚   โ”œโ”€โ”€ ...           
โ”‚   โ””โ”€โ”€ 2.7 

๐Ÿ“ 3. Model Evaluation

3.1 Classic Metrics Evaluation

Classic metrics evaluate text editing quality using OCR-based text accuracy, image-text alignment, and aesthetic quality. All metrics are reported separately for Virtual and Real splits.

Evaluated Metrics

AbbreviationMetricDescription
OAOCR AccuracyWhether the target text is correctly rendered in the editing region
OPOCR PrecisionPrecision of text content (target + background) in the generated image
OROCR RecallRecall of text content (target + background) in the generated image
F1OCR F1Harmonic mean of OCR Precision and Recall
NEDNormalized Edit DistanceROI-aware normalized edit distance between target and generated text
CLIPCLIPScoreCLIP-based image-text alignment score
AESAesthetic ScorePredicted aesthetic quality score of the generated image

Usage

Evaluation scripts are provided separately for fullset and miniset:

  • eval_scripts/classic_metrics_eval_full.sh โ€” evaluate on the full benchmark
  • eval_scripts/classic_metrics_eval_mini.sh โ€” evaluate on the 500-sample miniset

Step 1. Modify the contents of the configure script according to your project directory. (e.g., eval_scripts/classic_metrics_eval_full.sh):

MODELS="model-a,model-b,model-c"                    # Comma-separated list of model names to be evaluated

path="your_project_path_here"
CACHE_DIR="$path/TextEdit/checkpoint"               # Directory for all model checkpoints (OCR, CLIP, etc.)

BENCHMARK_DIR="$path/TextEdit/eval_prompts/fullset"
GT_ROOT_DIR="$path/TextEdit/data"                   # Root path for original & GT images
MODEL_OUTPUT_ROOT="$path/TextEdit/output"           # Root path for model infer outputs
OUTPUT_DIR="$path/TextEdit/result/classic_fullset"  # Evaluation result root path for classic metric

Note: All required model checkpoints (PaddleOCR, CLIP, aesthetic model, etc.) should be placed under the CACHE_DIR directory.

Step 2.Run evaluation shell script to evaluate your model output.

# Fullset evaluation
bash eval_scripts/classic_metrics_eval_full.sh

# Miniset evaluation
bash eval_scripts/classic_metrics_eval_mini.sh

Results are saved as {model_name}.json under the output directory, containing per-sample scores and aggregated metrics for both Virtual and Real splits.


3.2 VLM-based Metrics Evaluation

Our VLM-based evaluation uses Gemini-3-Pro-Preview as an expert judge to score text editing quality across five fine-grained dimensions. The evaluation is a two-step pipeline.

Evaluated Metrics

AbbreviationMetricDescription
TAText AccuracySpelling correctness and completeness of the target text (1โ€“5)
TPText PreservationPreservation of non-target background text (1โ€“5)
SIScene IntegrityGeometric stability of non-edited background areas (1โ€“5)
LRLocal RealismInpainting quality, edge cleanness, and seamlessness (1โ€“5)
VCVisual CoherenceStyle matching (font, lighting, shadow, texture harmony) (1โ€“5)
AvgWeighted AverageWeighted average of all five dimensions (default weights: 0.4 / 0.3 / 0.1 / 0.1 / 0.1)

All raw scores (1โ€“5) are normalized to 0โ€“1 for reporting. A cutoff mechanism is available: if TA (Q1) < 4, the remaining dimensions are set to 0, reflecting that a failed text edit invalidates other quality dimensions.

Step 1: Gemini API Evaluation

Send (Original Image, GT Image, Edited Image) triplets to the Gemini API for scoring.

Configure and run eval_scripts/vlm_metrics_eval_step1.sh:

API_KEY="your_gemini_api_key_here"
BASE_URL="your_gemini_api_base_url_here"

python eval_pipeline/vlm_metrics_eval_step1.py \
  --input_data_dir <your_path>/TextEdit/eval_prompts/fullset \
  --model_output_root <your_path>/TextEdit/output \
  --gt_data_root <your_path>/TextEdit/data \
  --output_base_dir <your_path>/TextEdit/result/vlm_gemini_full_answers \
  --model_name "gemini-3-pro-preview" \
  --models "model-a,model-b,model-c" \
  --api_key "$API_KEY" \
  --base_url "$BASE_URL" \
  --num_workers 64

Per-model .jsonl answer files are saved under the output_base_dir.

Step 2: Score Aggregation & Report

Aggregate the per-sample Gemini responses into a final report.

Configure and run eval_scripts/vlm_metrics_eval_step2.sh:

# Fullset report
python eval_pipeline/vlm_metrics_eval_step2.py \
  --answer_dir <your_path>/TextEdit/result/vlm_gemini_full_answers \
  --output_file <your_path>/TextEdit/result/gemini_report_fullset.json \
  --weights 0.4 0.3 0.1 0.1 0.1 \
  --enable_cutoff

# Miniset report
python eval_pipeline/vlm_metrics_eval_step2.py \
  --answer_dir <your_path>/TextEdit/result/vlm_gemini_mini_answers \
  --output_file <your_path>/TextEdit/result/gemini_report_miniset.json \
  --weights 0.4 0.3 0.1 0.1 0.1 \
  --enable_cutoff

Key parameters:

  • --weights: Weights for Q1โ€“Q5 (default: 0.4 0.3 0.1 0.1 0.1).
  • --enable_cutoff: Enable cutoff mechanism โ€” if Q1 < 4, set Q2โ€“Q5 to 0.

The output includes a JSON report, a CSV table, and a Markdown-formatted leaderboard printed to the console.


Citation

If you find our TextEdit Bench useful, please cite our InternVL-U technical report using this BibTeX.

@article{tian2026internvl,
  title={InternVL-U: Democratizing Unified Multimodal Models for Understanding, Reasoning, Generation and Editing},
  author={Tian, Changyao and Yang, Danni and Chen, Guanzhou and Cui, Erfei and Wang, Zhaokai and Duan, Yuchen and Yin, Penghao and Chen, Sitao and Yang, Ganlin and Liu, Mingxin and others},
  journal={arXiv preprint arXiv:2603.09877},
  year={2026}
}