Data Preparation
March 20, 2025 Β· View on GitHub
Our setup follows Referformer
Create a new directory data to store all the datasets.
πΌοΈ Ref-COCO
Download the dataset from the official website COCO.
RefCOCO/+/g use the COCO2014 train split.
Download the annotation files from github.
Convert the annotation files:
python3 tools/data/convert_refexp_to_coco.py
Finally, we expect the directory structure to be the following:
SAMWISE
βββ data
β βββ coco
β β βββ train2014
β β βββ refcoco
β β β βββ instances_refcoco_train.json
β β β βββ instances_refcoco_val.json
β β βββ refcoco+
β β β βββ instances_refcoco+_train.json
β β β βββ instances_refcoco+_val.json
β β βββ refcocog
β β β βββ instances_refcocog_train.json
β β β βββ instances_refcocog_val.json
π₯ Ref-Youtube-VOS
Download the dataset from the competition's website here. Then, extract and organize the file. We expect the directory structure to be the following:
SAMWISE
βββ data
β βββ ref-youtube-vos
β β βββ meta_expressions
β β βββ train
β β β βββ JPEGImages
β β β βββ Annotations
β β β βββ meta.json
β β βββ valid
β β β βββ JPEGImages
π¬ Ref-DAVIS17
Downlaod the DAVIS2017 dataset from the website. Note that you only need to download the two zip files DAVIS-2017-Unsupervised-trainval-480p.zip and DAVIS-2017_semantics-480p.zip.
Download the text annotations from the website.
Then, put the zip files in the directory as follows.
SAMWISE
βββ data
β βββ ref-davis
β β βββ DAVIS-2017_semantics-480p.zip
β β βββ DAVIS-2017-Unsupervised-trainval-480p.zip
β β βββ davis_text_annotations.zip
Unzip these zip files.
unzip -o davis_text_annotations.zip
unzip -o DAVIS-2017_semantics-480p.zip
unzip -o DAVIS-2017-Unsupervised-trainval-480p.zip
Preprocess the dataset to Ref-Youtube-VOS format. (Make sure you are in the main directory)
python tools/data/convert_davis_to_ytvos.py
Finally, unzip the file DAVIS-2017-Unsupervised-trainval-480p.zip again (since we use mv in preprocess for efficiency).
unzip -o DAVIS-2017-Unsupervised-trainval-480p.zip
π¦ MeViS
Download and unzip the dataset.
unzip -o MeViS_release.zip
The dataset follows a similar structure as Refer-Youtube-VOS.
Each split of the dataset consists of three parts:
JPEGImages, which holds the frame images, meta_expressions.json,
which provides referring expressions and metadata of videos,
and mask_dict.json, which contains the ground-truth masks of objects.
Ground-truth segmentation masks are saved in the format of COCO RLE,
and expressions are organized similarly like Refer-Youtube-VOS.
SAMWISE
βββ data
β βββ MeViS_release
β β βββ train
β β β βββJPEGImages
β β β βββmask_dict.json
β β β βββmeta_expressions.json
β β βββ valid_u
β β β βββJPEGImages
β β β βββmask_dict.json
β β β βββmeta_expressions.json
β β βββ valid
β β β βββJPEGImages
β β β βββmask_dict.json
β β β βββmeta_expressions.json