COST Dataset

December 25, 2023 · View on GitHub

The COST dataset includes the following components for training and evaluating MLLMs on object-level perception tasks:

  • RGB Images obtained from the COCO-2017 dataset.
  • Segmentation Maps for semantic, instance, and panoptic segmentation tasks, obtained using the publicly available DiNAT-L OneFormer model trained on the COCO dataset.
  • Questions obtained by prompting GPT-4 for object identification and object order perception tasks. You can find the questions in questions.py.
  • Depth Maps obtained using the publicly available ViT-L/14 distilled variant of DINOv2 DPT model trained on the NYUd dataset.

We represent the information from the segmentation maps and depth maps in text form to obtain the final question-answer pairs. Please refer to Sec 3.1 in our paper for more details.

We provide different splits of the COST dataset for training and evaluation.

splitNumber of ImagesNumber of QnA pairssplits from COCO
train280k280ktrain2017, test2017, unlabeled2017
val5k5kval2017

File Structure

coco_segm_text
    ├── depth
    │   └── test
    │   │   └── ...
    │   └── train
    │   │   └── depth # contains depth maps for the train2017 split
    │   │   └── panoptic_order.txt # contains answers for object order perception task on images in test2017 split
    │   └── unlabeled
    │   │   └── ...
    │   └── val
    │   │   └── ...
    ├── test
    │   └── ...
    ├── train
    │   └── instance_inference # contains instance masks for train2017 split
    │   └── instance.txt # contains answers for instance object identification task on images in train2017 split
    │   └── panoptic_inference # contains panoptic masks for train2017 split
    │   └── panoptic.txt # contains answers for panoptic object identification task on images in train2017 split
    │   └── semantic_inference # contains semantic masks for train2017 split
    │   └── semantic.txt # contains answers for instance object identification task on images in train2017 split
    ├── unlabeled
    │   └── ...
    ├── val
    │   └── ...

Citation

If you use the COST dataset, please consider starring ⭐ us on GitHub and citing 📚 us in your research!

@article{jain2023vcoder,
    title={{VCoder: Versatile Vision Encoders for Multimodal Large Language Models}},
    author={Jitesh Jain and Jianwei Yang and Humphrey Shi},
    journal={arXiv},
    year={2023}
}