ChartArena

June 6, 2026 · View on GitHub

A Comprehensive Bilingual Benchmark for General Chart Parsing across Families, Scenarios, and Formats

中文版PaperGitHub RepoHuggingFace DatasetModelScope Dataset

News

  • [2026.06.01] 📖 Code and data are released!

Overview

ChartArena is a comprehensive bilingual benchmark for evaluating the chart parsing capabilities of vision-language models, spanning the full difficulty spectrum of charts encountered in practice. It covers eight chart families: both numeric charts (bar, line, pie, radar, box plot, combination) and diagrammatic structures (flowchart, mind map), each presented across three visual scenarios (digital renderings, printed photos, and hand-drawn photos) and two languages (Chinese and English).

To enable fair comparison across models that produce mutually incompatible output formats, ChartArena adopts a format-agnostic evaluation protocol: heterogeneous predictions are normalized into two canonical semantic spaces: a triple view for numeric charts and a directed graph view for diagrammatic charts, and scored with structure-aware metrics.

Contents

Benchmark Statistics

ItemDetails
Chart Families8 (bar, line, pie, radar, box plot, combination, flowchart, mind map)
Chart CategoriesNumeric charts, mind maps, flowcharts
Visual Scenarios3 (digital rendering, printed photo, hand-drawn photo)
LanguagesBilingual (Chinese and English)

Leaderboard

We evaluate 26 models across three categories: general-purpose MLLMs, document parsing MLLMs, and expert chart understanding models. Results are reported as mAPhigh_{high} per chart family, with separate EN (English) and ZH (Chinese) scores each averaged over three visual scenarios. Within each category, bold marks the best result per column.

Full leaderboard (click to expand)

General-Purpose MLLMs

ModelDateBar (EN)Bar (ZH)Line (EN)Line (ZH)Pie (EN)Pie (ZH)Radar (EN)Radar (ZH)Box (EN)Box (ZH)Combo (EN)Combo (ZH)Flow (EN)Flow (ZH)Mind (EN)Mind (ZH)Avg (EN)Avg (ZH)
GPT-4o2024.0521.636.327.552.976.774.29.724.919.19.69.940.749.827.164.024.834.836.3
GPT-52025.0835.152.348.165.181.178.932.041.519.812.814.246.558.135.376.633.545.645.8
Qwen2.5-VL-7B-Instruct2025.0215.236.917.939.963.473.18.319.10.92.86.040.629.723.245.429.923.333.2
Qwen2.5-VL-72B-Instruct2025.0227.153.338.266.773.577.010.938.515.015.314.350.550.143.663.855.036.650.0
InternVL3.5-8B2025.0820.949.434.149.963.972.612.635.74.310.77.641.231.524.347.032.227.739.5
InternVL3.5-241B-A28B2025.0827.557.241.355.777.783.315.241.418.721.617.747.843.836.662.645.538.048.6
Qwen3VL-8B-Instruct2025.1033.963.443.167.978.688.316.852.135.730.414.251.950.041.575.262.643.457.3
Qwen3VL-235B-A22B-Ins.2025.1044.571.957.177.185.887.924.652.454.855.129.160.857.949.879.473.754.266.1
Qwen3.5-35B-A3B2026.0248.068.160.477.689.788.725.257.950.150.635.262.162.556.577.175.656.067.1
GLM-4.5V2025.0733.561.451.770.581.283.119.743.132.437.421.252.544.739.666.243.743.853.9
Seed-1.8 (non-thinking)2025.1229.159.746.072.584.788.022.045.916.117.515.059.747.850.376.569.142.257.8
Seed-2.0 Pro (non-thinking)2026.0240.373.356.580.791.590.521.354.744.555.232.462.262.661.383.185.854.070.5
Kimi K2.5 (non-thinking)2026.0245.270.360.979.887.286.730.259.740.647.633.663.659.957.980.879.454.868.1
MiMo-V2-Omni2026.0331.156.941.566.487.085.819.746.119.130.319.454.757.151.076.664.643.957.0
Gemini 2.5 Pro2025.0346.076.556.577.688.687.317.553.010.222.128.757.662.157.871.767.147.762.4
Gemini 3.1 Pro2026.0257.978.767.085.392.595.131.862.732.545.239.770.365.663.186.885.259.273.2

Document Parsing MLLMs

ModelDateBar (EN)Bar (ZH)Line (EN)Line (ZH)Pie (EN)Pie (ZH)Radar (EN)Radar (ZH)Box (EN)Box (ZH)Combo (EN)Combo (ZH)Flow (EN)Flow (ZH)Mind (EN)Mind (ZH)Avg (EN)Avg (ZH)
dots.mocr (3B)2025.0728.340.941.860.168.878.320.343.124.116.026.947.126.220.628.719.633.140.7
PaddleOCR-VL (1B)2025.1031.849.343.051.657.575.214.429.011.720.721.354.0--------23.935.8
HunyuanOCR (1B)2025.1133.060.049.568.271.074.819.041.143.945.220.150.839.935.955.046.641.452.8

Expert Chart Understanding Models

ModelDateBar (EN)Bar (ZH)Line (EN)Line (ZH)Pie (EN)Pie (ZH)Radar (EN)Radar (ZH)Box (EN)Box (ZH)Combo (EN)Combo (ZH)Flow (EN)Flow (ZH)Mind (EN)Mind (ZH)Avg (EN)Avg (ZH)
ChartAst (13B)2024.015.2--4.2--0.3--1.5--0.3--0.0----------1.4--
ChartVLM (8.3B)2024.0211.25.311.54.312.98.22.15.00.70.44.14.4--------5.33.5
TinyChart (3B)2024.046.16.39.73.25.75.40.53.40.21.30.74.2--------2.93.0
ChartMoE (8B)2024.0918.724.414.722.315.048.53.716.12.71.65.119.54.0--4.1--8.516.7
ChartCoder (7B)2025.0123.212.622.019.634.316.75.513.95.411.43.75.15.6--1.0--12.69.9
RRVF (7B)2025.0735.866.541.554.351.675.316.640.314.714.123.561.236.432.468.463.836.051.0
MSRL (7B)2025.0832.745.235.234.341.267.925.948.011.213.016.735.223.212.431.018.827.134.3

Task Definitions

ChartArena groups charts into three categories, each with a default extraction task:

Chart CategoryExamplesDefault Task
Numerical chartsBar / Line / Pie / Radar / Box / Combo …SE_MD
Mind maps (logic diagrams)Tree / hierarchy diagramsSE_MD
FlowchartsProcess / workflow diagramsSE_MERMAID
The eleven extraction tasks (click to expand)
TaskOutput FormatDescription
SE_MDMarkdown table / listNumerical charts → Markdown table; mind maps → Markdown nested list
SE_JSONJSONStructured JSON with title and values
SE_CSVCSVComma-separated values
SE_CODEPython (matplotlib)Reproduce the chart as executable Python code
SE_SVGSVGReproduce the chart as SVG markup
SE_MERMAIDMermaidFlowchart as Mermaid diagram syntax
SE_GRAPHVIZGraphviz DOTFlowchart as DOT language
SE_PLANTUMLPlantUMLFlowchart as PlantUML syntax
SE_DIAGRAMSdiagrams.net XMLFlowchart as draw.io XML
SE_D2D2Flowchart as D2 diagram language
SE_CYTOSCAPECytoscape JSONFlowchart as Cytoscape.js JSON

Scoring metrics: mAP (map_strict / map_slight / map_high) and EM (exact match).

Getting Started

1. Setup

git clone <this-repo>
cd ChartArena
pip install -r requirements.txt
# Optional: only if you plan to use --api_type local_vllm
pip install vllm

2. Download benchmark data

The dataset (jsonl + images) is released as a single archive. Place the files under data/:

data/
├── ChartArena.jsonl
└── images/...

Each line of the jsonl looks like:

{
  "img_path": "images/xxx.png",
  "chart_type": "柱状图",
  "img_type": "电子印刷",
  "lang_type": "中文",
  "anno": "..."
}

img_path is a relative path from the data/ directory and is used as the unique key throughout the pipeline.

3. Inference

Two backends are supported via --api_type: openai_compat for any OpenAI-compatible HTTP service (local or cloud), and local_vllm for in-process model loading. Inference supports resume — re-running the same command skips already-completed samples.

Backend details and task selection (click to expand)

(a) openai_compat — any OpenAI-compatible HTTP service

Works with locally-served models (vllm serve, sglang, lmdeploy) or public APIs that speak the OpenAI Chat Completions protocol (OpenAI, Gemini, Claude, Together, …).

python infer.py \
    --api_type openai_compat \
    --model_name Qwen2.5-VL-72B-Instruct \
    --base_url http://127.0.0.1:8000/v1 \
    --api_key EMPTY \
    --max_workers 64

(b) local_vllm — in-process vLLM, give it a model path

No need to start a server first. The script loads the checkpoint directly with vllm.LLM.

python infer.py \
    --api_type local_vllm \
    --model_path /path/to/Qwen2.5-VL-72B-Instruct \
    --tensor_parallel_size 4 \
    --max_model_len 32768

Task selection

By default, each chart category runs one task. You can override with --task_data, --task_logic, --task_flowchart (each accepts one or more task names):

# Run SE_MD and SE_JSON for numerical charts, SE_MERMAID for flowcharts
python infer.py --api_type openai_compat --model_name ... --base_url ... \
    --task_data SE_MD SE_JSON \
    --task_flowchart SE_MERMAID

Output

Each run writes one jsonl file:

infer_outputs/<model_tag>/results.jsonl

<model_tag> defaults to --model_name / basename of --model_path. You can override it with --output_tag.

4. Judging

# Score all models under infer_outputs/
python judge.py

# Score specific models only
python judge.py --models Qwen2.5-VL-72B-Instruct gemini-2.5-pro

# Force re-score a specific task (e.g. after a scoring algorithm update)
python judge.py --force_rejudge SE_MERMAID

Outputs to judge_outputs/<model_tag>/results.jsonl. The judge step is purely rule-based and fast.

5. Analysis report

python analyze.py
# → judge_outputs/results_analysis.xlsx

Scores are shown to 3 decimal places (e.g. 0.873).

Workbook contents (click to expand)
  • Task overview (Sheet 1) — per-model average score for each task
  • Per-task sheets — model × source file breakdown for each task
  • By chart type (by_chart_type/) — one Excel per task, one sheet per chart type
  • Detailed breakdown (detail_by_category/) — per-model per-task breakdown by (chart_type, img_type, lang_type)

6. End-to-end example

# 1. Run inference
python infer.py \
    --api_type openai_compat \
    --model_name Qwen2.5-VL-72B-Instruct \
    --base_url http://127.0.0.1:8000/v1

# 2. Score
python judge.py

# 3. Generate Excel report
python analyze.py

7. Repo layout

The repository is organized around a three-stage pipeline (inference → judging → analysis), with pluggable API backends, per-format scoring modules, and shared metric utilities.

Full directory tree (click to expand)
ChartArena/
├── README.md / README_zh.md
├── requirements.txt
├── data/                        # ← download benchmark data here
├── infer_outputs/               # inference results (auto-created)
├── judge_outputs/               # scoring results  (auto-created)
├── apis/
│   ├── base.py                  # APIBase abstract class
│   ├── openai_compat.py         # OpenAI-compatible client
│   └── local_vllm.py            # in-process vLLM
├── methods/
│   ├── prompts.py               # prompt templates
│   ├── context.py               # context building utilities
│   ├── normalize.py             # output normalization
│   ├── scoring.py               # scoring entry points
│   └── parsers/                 # per-format output parsers
├── metrics/
│   ├── SCRM.py                  # core MAP / EM metric
│   ├── tree_eval.py             # Markdown list evaluation
│   ├── mermaid_eval.py          # Mermaid diagram evaluation
│   ├── flowchart_common.py      # flowchart multi-format evaluation
│   └── dsl_parsers/             # DSL-specific parser utilities
├── utils/
│   ├── io.py                    # ResultWriter (thread-safe incremental writer)
│   ├── signal_utils.py          # graceful Ctrl+C shutdown
│   └── image_utils.py           # base64 encoding for OpenAI-compat
├── infer.py                     # entry: inference
├── judge.py                     # entry: rule-based scoring
└── analyze.py                   # entry: Excel analysis report

Citation

@article{peng2026chartarena,
  title   = {{ChartArena}: Benchmarking Chart Parsing across Languages, Scenarios, and Formats},
  author  = {Peng, Shangpin and Li, Gengluo and Wan, Xingyu and Zhang, Chengquan and Feng, Hao and Wu, Binghong and Shen, Huawen and Wang, Weinong and Cai, Ziyi and Tian, Zhuotao and others},
  journal = {arXiv preprint arXiv:2606.01348},
  year    = {2026}
}

License

This benchmark is released for research purposes only.