Microsoft COCO Dataset (Retrieval)

July 31, 2022 · View on GitHub

(Samples from the COCO Caption dataset. Image credit: "https://arxiv.org/pdf/1504.00325.pdf")

Microsoft COCO Dataset (Retrieval)

Description

Microsoft COCO dataset contains over one and a half million captions describing over 330,000 images. For the training and validation images, five independent human generated captions are be provided for each image.

Task

Cross modal retrieval: (1) image-text: given an image as query, retrieve texts from a gallery; (2) text-image: given a text as query, retrieval images from a gallery.

Metrics

Common metrics are recall@k, denotes the recall score after k retrieval efforts.

We use TR to denote the image-text retrieval recall score and IR to denote text-image retrieval score.

Leaderboard

(Ranked by TR@1.)

Rank	Model	TR@1	TR@5	TR@10	IR@1	IR@5	IR@10	Resources
1	BLIP	82.4	95.4	97.9	65.1	86.3	91.8	paper, code, demo, blog
2	X-VLM	81.2	95.6	98.2	63.4	85.8	91.5	paper, code
3	ALBEF	77.6	94.3	97.2	60.7	84.3	90.5	paper, code, blog
3	ALIGN	77.0	93.5	96.9	59.9	83.3	89.8	paper
4	VinVL	75.4	92.9	96.2	58.8	83.5	90.3	paper, code
5	OSCAR	73.5	92.2	96.0	57.5	82.8	89.8	paper, code
6	UNITER	65.7	88.6	93.8	52.9	79.9	88.0	paper, code

Auto-Downloading

cd lavis/datasets/download_scripts && python download_coco.py

References

"Microsoft COCO Captions: Data Collection and Evaluation Server", Xinlei Chen, Hao Fang, Tsung-Yi Lin, Ramakrishna Vedantam, Saurabh Gupta, Piotr Dollar, C. Lawrence Zitnick