Data Preparation

March 20, 2025 Β· View on GitHub

Our setup follows Referformer Create a new directory data to store all the datasets.

πŸ–ΌοΈ Ref-COCO

Download the dataset from the official website COCO.
RefCOCO/+/g use the COCO2014 train split. Download the annotation files from github.

Convert the annotation files:

python3 tools/data/convert_refexp_to_coco.py

Finally, we expect the directory structure to be the following:

SAMWISE
β”œβ”€β”€ data
β”‚   β”œβ”€β”€ coco
β”‚   β”‚   β”œβ”€β”€ train2014
β”‚   β”‚   β”œβ”€β”€ refcoco
β”‚   β”‚   β”‚   β”œβ”€β”€ instances_refcoco_train.json
β”‚   β”‚   β”‚   β”œβ”€β”€ instances_refcoco_val.json
β”‚   β”‚   β”œβ”€β”€ refcoco+
β”‚   β”‚   β”‚   β”œβ”€β”€ instances_refcoco+_train.json
β”‚   β”‚   β”‚   β”œβ”€β”€ instances_refcoco+_val.json
β”‚   β”‚   β”œβ”€β”€ refcocog
β”‚   β”‚   β”‚   β”œβ”€β”€ instances_refcocog_train.json
β”‚   β”‚   β”‚   β”œβ”€β”€ instances_refcocog_val.json

πŸŽ₯ Ref-Youtube-VOS

Download the dataset from the competition's website here. Then, extract and organize the file. We expect the directory structure to be the following:

SAMWISE
β”œβ”€β”€ data
β”‚   β”œβ”€β”€ ref-youtube-vos
β”‚   β”‚   β”œβ”€β”€ meta_expressions
β”‚   β”‚   β”œβ”€β”€ train
β”‚   β”‚   β”‚   β”œβ”€β”€ JPEGImages
β”‚   β”‚   β”‚   β”œβ”€β”€ Annotations
β”‚   β”‚   β”‚   β”œβ”€β”€ meta.json
β”‚   β”‚   β”œβ”€β”€ valid
β”‚   β”‚   β”‚   β”œβ”€β”€ JPEGImages

🎬 Ref-DAVIS17

Downlaod the DAVIS2017 dataset from the website. Note that you only need to download the two zip files DAVIS-2017-Unsupervised-trainval-480p.zip and DAVIS-2017_semantics-480p.zip. Download the text annotations from the website. Then, put the zip files in the directory as follows.

SAMWISE
β”œβ”€β”€ data
β”‚   β”œβ”€β”€ ref-davis
β”‚   β”‚   β”œβ”€β”€ DAVIS-2017_semantics-480p.zip
β”‚   β”‚   β”œβ”€β”€ DAVIS-2017-Unsupervised-trainval-480p.zip
β”‚   β”‚   β”œβ”€β”€ davis_text_annotations.zip

Unzip these zip files.

unzip -o davis_text_annotations.zip
unzip -o DAVIS-2017_semantics-480p.zip
unzip -o DAVIS-2017-Unsupervised-trainval-480p.zip

Preprocess the dataset to Ref-Youtube-VOS format. (Make sure you are in the main directory)

python tools/data/convert_davis_to_ytvos.py

Finally, unzip the file DAVIS-2017-Unsupervised-trainval-480p.zip again (since we use mv in preprocess for efficiency).

unzip -o DAVIS-2017-Unsupervised-trainval-480p.zip

🐦 MeViS

Download and unzip the dataset.

unzip -o MeViS_release.zip

The dataset follows a similar structure as Refer-Youtube-VOS. Each split of the dataset consists of three parts: JPEGImages, which holds the frame images, meta_expressions.json, which provides referring expressions and metadata of videos, and mask_dict.json, which contains the ground-truth masks of objects. Ground-truth segmentation masks are saved in the format of COCO RLE, and expressions are organized similarly like Refer-Youtube-VOS.

SAMWISE
β”œβ”€β”€ data
β”‚   β”œβ”€β”€ MeViS_release
β”‚   β”‚   β”œβ”€β”€ train
β”‚   β”‚   β”‚   β”œβ”€β”€JPEGImages
β”‚   β”‚   β”‚   β”œβ”€β”€mask_dict.json
β”‚   β”‚   β”‚   β”œβ”€β”€meta_expressions.json
β”‚   β”‚   β”œβ”€β”€ valid_u
β”‚   β”‚   β”‚   β”œβ”€β”€JPEGImages
β”‚   β”‚   β”‚   β”œβ”€β”€mask_dict.json
β”‚   β”‚   β”‚   β”œβ”€β”€meta_expressions.json
β”‚   β”‚   β”œβ”€β”€ valid
β”‚   β”‚   β”‚   β”œβ”€β”€JPEGImages
β”‚   β”‚   β”‚   β”œβ”€β”€mask_dict.json
β”‚   β”‚   β”‚   β”œβ”€β”€meta_expressions.json