๐ฆ How to Prepare Datasets for ControlMLLM
July 14, 2025 ยท View on GitHub
This document provides detailed instructions for preparing datasets required by ControlMLLM++, including for the ROC, RTC, and Reference Description (RD) tasks.
We recommend placing all datasets under a root folder, e.g., $DATA/, for consistency and ease of path management. You may create symbolic links to reuse existing dataset files.
๐ง Task Overview
-
ROC (Referring Object Classification)
Given an image and a region, the model classifies the type of the object referred by the region. -
RTC (Referring Text Classification)
Given an image and a text region, the model classifies or interprets the text content shown in the image. -
RD (Reference Description)
The model is asked to generate a natural language description of a referred region, aiming at free-form expression and understanding.
In all tasks, we focus on single-region prompts to keep input precise and interpretable.
๐ Directory Structure Overview
\$DATA/
โโโ ROC/
โ โโโ question_roc.json
โ โโโ LVIS/
โ โโโ image/
โ โโโ mask/
โโโ RTC/
โ โโโ question_rtc.json
โ โโโ COCO-Text/
โ โโโ image/
โ โโโ mask/
โโโ RD/
โ โโโ RefCOCOg/
โ โ โโโ refcocog.json
โ โ โโโ COCO2014/
โ โ โโโ train2014/
โ โ โโโ annotations/
โ โ โโโ instances_train2014.json
โ โโโ ScreenSpot/
โ โโโ question_screenspot.json
โ โโโ image/
๐ฝ Dataset Download
ROC + RTC
๐ Download ROC & RTC (Google Drive)
Unzip the contents and place them in:
\$DATA/
โโโ ROC/
โโโ RTC/
RefCOCOg
- ๐ Question file: refcocog.json (Google Drive)
- ๐ผ Image download: COCO2014 Train Images (train2014.zip)
- ๐ Annotations: COCO2014 Annotations
Unpack files and organize as:
RD/RefCOCOg/
โโโ refcocog.json
โโโ COCO2014/
โโโ train2014/
โโโ annotations/
โโโ instances_train2014.json
ScreenSpot
- ๐ Question file: question_screenspot.json (Google Drive)
- ๐ผ Images: Download Screenshots (NJU Box)
Organize as:
RD/ScreenSpot/
โโโ question_screenspot.json
โโโ image/
๐ฃ Prompt Format
ScreenSpot
ScreenSpot is an evaluation benchmark for GUI grounding, comprising over 1,200 instructions from diverse environments including iOS, Android, macOS, Windows, and Web. Each data point is annotated with element type (Text or Icon).
-
For
Iconelements:
"What is this icon used for?" -
For
Textelements:
"What does this text say?"
RefCOCOg
The RefCOCOg dataset is a referring expression generation (REG) benchmark used to evaluate understanding of language that refers to specific objects in natural images.
- Generic prompt:
"Can you provide a description of the region in a sentence?"
Prompt Differences by Model
-
LLaVA-based models (no localization pretraining): Use direct natural language queries as above.
-
Qwen2.5-VL (trained with grounding): Include box location to enhance region awareness:
"Can you provide me with a detailed description of the region in the picture marked by box @ [x1, y1, x2, y2]."
๐ Final Notes
- Make sure all
.jsonfiles and images/masks follow the specified structure. - Task scripts will expect the default root directory to be
data/(relative to project root). - You may modify
--data_patharguments to specify custom locations during execution.